-
Notifications
You must be signed in to change notification settings - Fork 7
Description
Context
DTPR has now been used in dozens of cities to bring transparency to data collection in public space. It has proven itself effective in communicating to citizens the who, how, where, why of data collection, increasing citizens’ understanding and trust of the sensors deployed. For organizations that manage public spaces, the process of deploying DTPR has proven to break internal silos by bringing together disparate departments–public works, smart cities, IT, public engagement, etc.–getting all staff on the same page in understanding the purpose of a sensor deployment, its risks, its value, and the lifecycle of the data it will produce.
Problem
DTPR was designed to describe sensors collecting data in public space. As deployers of DTPR have grown in their ambitions to develop comprehensive digital transparency programs for their constituencies, public data collection is only one part.
Today, a comprehensive digital transparency program would be incomplete if it did not include transparency on the role AI or algorithms used in public decision making. DTPR is not currently suited to describe these.
Proposal
Grow and adapt DTPR’s framework, methodology, and taxonomy to describe an AI model or algorithms.
The current DTPR datachain, what I’ll call the Sensor datachain is designed to describe a sensor collecting data in public space, and the story of the data collected–how it is processed, who has access to it, where and for how long is it stored.
An AI datachain would describe an AI or algorithm: who is accountable for it and its purpose, the primary AI/algorithm technologies being used, what decisions will be made using it and what level of autonomy it has in those decisions, what risks it may pose, who has access to it, and where the AI is run and its results stored.
Datachain Categories
| Sensor datachain | AI datachain | Contextual color | |
|---|---|---|---|
| hexagon | Accountable | Accountable | |
| hexagon | Purpose | Purpose | |
| hexagon | Decision Making 🆕 | Level of autonomy in decision-making | |
| hexagon | (Data-collection) Technology | Data collected is personally identifiable | |
| circle | Data Type [deprecate] 🗑️ | ||
| circle | Input Datasets 🆕 | ||
| circle | Processing (Technology) | Processing (Technology) ♻️ with location context |
|
| circle | Output Datasets 🆕 | Output Datasets 🆕 | |
| square | Access | Access | |
| square | Storage with location context |
Storage with location context |
|
| octagon | Risks & Mitigation 🆕 | ||
| octagon | Rights 🆕 | Rights 🆕 |
“Decision Making” taxonomy category 🆕
What decisions does this AI/algorithm inform?
Possible Elements
- Allocation of resources
- For example, optimization of trash pickup routes, traffic plans
- Accept or deny
- For example, civil service application, or a child’s acceptance to a school
- Ranking of priority
Contextual color
Level of Autonomy
- Orange: The AI/Algorithm processes, decides, and executes without human involvement
- Yellow: The AI/Algorithm processes and generates a decision, but a human required to execute
- Blue: The AI/Algorithm system processes and flags info for human to evaluate, decide, and action
“Input/Output Datasets” taxonomy category 🆕
What datasets are inputs into the algorithm/AI/data processing pipeline?
What datasets are produced by the algorithm/AI/data processing pipeline?
Possible Elements
This category would reuse the Data Type icons, but require a user to name and define a dataset based on its type.
Example Usage
Let's take the example of an algorithm or model that assigns children what elementary school they will attend in a city. Potential input datasets might include school capacity and current class sizes, the locations of students' homes, the locations of schools, school bus routes, public transportation routes, etc. Potential output datasets might be the list of children and their assigned schools, optimization analysis showing how the algorithm optimized for class size, or for transportation distance, etc.
Note, input datasets are not meant to define training data, but instead an input into an already trained model or algorithm. Training data and its potential risks and mitigations should be defined in the "Risks & Mitigation" category.
“Processing Technology” category 🆕
What type of AI/Algorithm is being used?
Possible Elements
This category expand on the existing elements from the "Processing" category, adding more specific types of AI and algorithmic systems such as:
- Regression Analysis
- Large Language Model
- Natural Language Processing
- Image recognition algorithms
Some questions I still have:
- What level of detail and precision would should we aim for to find the balance between technical accuracy that builds trust vs explanatory simplicity that helps with understanding?
- For example, there are many different types of image recognition algorithms (Convolutional Neural Networks, Residual Networks, Scale-Invariant Feature Transform, etc.). Should we be that precise, or does “image recognition algorithm” suffice?
Location Context
Like the "Storage" category, which already has elements that define where data is stored, "Processing Technology" should have what I'm calling "Location Context" elements that allow a user to define where the processing is taking place. Examples include:
- Processed locally
- Processed internationally
- Processed on a 3rd party cloud
“Risks & Mitigation” taxonomy category 🆕
What identified risks does this AI/algorithm have?
A team at MIT conducted a meta-review of AI risks. In their own words,
The AI Risk Repository has three parts:
- The AI Risk Database captures 1000+ risks extracted from 56 existing frameworks and classifications of AI risks
- The Domain Taxonomy of AI Risks classifies these risks into 7 domains (e.g., “Misinformation”) and 23 subdomains (e.g., “False or misleading information”)
You can see the taxonomies here.
Leveraging this existing, open source work, we could generate elements for each risk in the MIT taxonomy.
Possible Elements
A user of these elements would outline the mitigation to the given risks the DTPR elements “additional description” field.
- False or misleading information
- Over-reliance and unsafe use
- Unequal performance across groups
- Compromise of privacy by obtaining, leaking, or correctly inferring sensitive information
- Power centralization and unfair distribution of benefits
- AI system security vulnerabilities and attacks
“Rights” taxonomy category 🆕
What rights do I have in relation to this AI/Algorithm?
See the open proposal on adding a
Rightscategory to DTPR here: #230
What’s missing and why
Define what training data was used to
I'd argue it should not be its own category. Here are my reasons:
- not all algorithms use training data
- it can be difficult to even know what training was used (ex. some LLMs)
- the general, non-technical public understanding of the risks and benefits of an AI/algorithm is, I would argue, not helped much by knowing what data a given model was trained on. A much more effective way to communicate this information would be to document what risks the training data may introduce, and explain the mitigation strategies used to avoid that risk.
Request for Comments
This RFC is meant to spark a discussion amongst users of DTPR and other interested parties around how to adapt DTPR to AI. This is just the beginning. There is a lot more to define and decide. Consider this a jumping off point.
The first questions I would appreciate thoughts on are:
- Do the
AI datachaincategories capture what the general public wants to know about an AI/algorithm? Are we missing anything? - What other research and resources might we learn from?