[RFC] Create a new datachain pattern to describe an AI model or algorithm

# Context

DTPR has now been used in dozens of cities to bring transparency to data collection in public space. It has proven itself effective in communicating to citizens the who, how, where, why of data collection, increasing citizens’ understanding and trust of the sensors deployed. For organizations that manage public spaces, the process of deploying DTPR has proven to break internal silos by bringing together disparate departments–public works, smart cities, IT, public engagement, etc.–getting all staff on the same page in understanding the purpose of a sensor deployment, its risks, its value, and the lifecycle of the data it will produce.

# Problem

DTPR was designed to describe **sensors collecting data in public space**. As deployers of DTPR have grown in their ambitions to develop comprehensive digital transparency programs for their constituencies, public data collection is only one part.

Today, a comprehensive digital transparency program would be incomplete if it did not include **transparency on the role AI or algorithms** used in public decision making. DTPR is not currently suited to describe these. 

# Proposal

**Grow and adapt DTPR’s framework, methodology, and taxonomy to describe an AI model or algorithms.**

The current DTPR datachain, what I’ll call the `Sensor datachain` is designed to describe a sensor collecting data in public space, and the story of the data collected–how it is processed, who has access to it, where and for how long is it stored.

An `AI datachain` would describe an AI or algorithm: who is accountable for it and its purpose, the primary AI/algorithm technologies being used, what decisions will be made using it and what level of autonomy it has in those decisions, what risks it may pose, who has access to it, and where the AI is run and its results stored.

**Datachain Categories**

|  | **Sensor datachain** | **AI datachain** | *Contextual color* |
| --- | --- | --- | --- |
| hexagon | Accountable | Accountable |  |
| hexagon | Purpose | Purpose |  |
| hexagon |  | Decision Making 🆕 | *Level of autonomy in decision-making* |
| hexagon | (Data-collection) Technology |  | *Data collected is personally identifiable* |
| circle | Data Type [deprecate] 🗑️ |  |  |
| circle |  | Input Datasets 🆕 |  |
| circle | Processing (Technology) | Processing (Technology) ♻️<br>*with location context* |  |
| circle | Output Datasets 🆕 | Output Datasets 🆕 |  |
| square | Access | Access |  |
| square | Storage<br>*with location context* | Storage<br>*with location context* |  |
| octagon |  | Risks & Mitigation 🆕 |  |
| octagon | Rights 🆕 | Rights 🆕 |  |

## “Decision Making” taxonomy category 🆕

> *What decisions does this AI/algorithm inform?*

### **Possible Elements**

- Allocation of resources
    - For example, optimization of trash pickup routes, traffic plans
- Accept or deny
    - For example, civil service application, or a child’s acceptance to a school
- Ranking of priority

### **Contextual color**

*Level of Autonomy*

- Orange: The AI/Algorithm processes, decides, and executes without human involvement
- Yellow: The AI/Algorithm processes and generates a decision, but a human required to execute
- Blue: The AI/Algorithm system processes and flags info for human to evaluate, decide, and action

## “Input/Output Datasets” taxonomy category 🆕

> *What datasets are inputs into the algorithm/AI/data processing pipeline?*
> *What datasets are produced by the algorithm/AI/data processing pipeline?*

### Possible Elements
This category would reuse the Data Type icons, but require a user to name and define a dataset based on its type. 

### Example Usage
Let's take the example of an algorithm or model that assigns children what elementary school they will attend in a city. Potential *input datasets* might include school capacity and current class sizes, the locations of students' homes, the locations of schools, school bus routes, public transportation routes, etc. Potential *output datasets* might be the list of children and their assigned schools, optimization analysis showing how the algorithm optimized for class size, or for transportation distance, etc.

Note,  **input datasets are not meant to define training data**, but instead an input into an already trained model or algorithm. Training data and its potential risks and mitigations should be defined in the "Risks & Mitigation" category.

## “Processing Technology” category 🆕

> *What type of AI/Algorithm is being used?*
> 

### Possible Elements
This category expand on the existing elements from the "Processing" category, adding more specific types of AI and algorithmic systems such as: 

- Regression Analysis
- Large Language Model
- Natural Language Processing
- Image recognition algorithms

Some questions I still have:

- What level of detail and precision would should we aim for to find the balance between technical accuracy that builds trust vs explanatory simplicity that helps with understanding?
    - For example, there are many different types of image recognition algorithms (Convolutional Neural Networks, Residual Networks, Scale-Invariant Feature Transform, etc.). Should we be that precise, or does “image recognition algorithm” suffice?

### Location Context

Like the "Storage" category, which already has elements that define **where** data is stored, "Processing Technology" should have what I'm calling "Location Context" elements that allow a user to define where the processing is taking place. Examples include:

- Processed locally
- Processed internationally
- Processed on a 3rd party cloud

## “Risks & Mitigation” taxonomy category 🆕

> *What identified risks does this AI/algorithm have?*
> 

[A team at MIT conducted a meta-review of AI risks](https://arxiv.org/pdf/2408.12622). In their own words, 

> The AI Risk Repository has three parts:
> 
> - The **AI Risk Database** captures 1000+ risks extracted from 56 existing frameworks and classifications of AI risks
> - The **Domain Taxonomy of AI** **Risks** classifies these risks into 7 domains (e.g., “Misinformation”) and 23 subdomains (e.g., “False or misleading information”)

You can see the taxonomies [here](https://docs.google.com/presentation/d/1wxg-hZAjGvFHcsfnEp1KAJJo5xvf98MB2v50B5URXZM/edit#slide=id.g314f5134687_0_70).

Leveraging this existing, open source work, we could generate elements for each risk in the MIT taxonomy.

### **Possible Elements**

> A user of these elements would outline the mitigation to the given risks the DTPR elements “additional description” field.
> 
- False or misleading information
- Over-reliance and unsafe use
- Unequal performance across groups
- Compromise of privacy by obtaining, leaking, or correctly inferring sensitive information
- Power centralization and unfair distribution of benefits
- AI system security vulnerabilities and attacks

## “Rights” taxonomy category 🆕

> *What rights do I have in relation to this AI/Algorithm?*
> 
> See the open proposal on adding a `Rights` category to DTPR here: https://github.com/Helpful-Places/dtpr/issues/230

## What’s missing and why

**Define what training data was used to** 

I'd argue it should not be its own category. Here are my reasons:
- not all algorithms use training data
- it can be difficult to even know what training was used (ex. some LLMs)
- the general, non-technical public understanding of the risks and benefits of an AI/algorithm is, I would argue, not helped much by knowing what data a given model was trained on. A much more effective way to communicate this information would be to document what risks the training data may introduce, and explain the mitigation strategies used to avoid that risk.

# Request for Comments

This RFC is meant to spark a discussion amongst users of DTPR and other interested parties around how to adapt DTPR to AI. This is just the beginning. There is a lot more to define and decide. Consider this a jumping off point.

The first questions I would appreciate thoughts on are:

- Do the `AI datachain` categories capture what the general public wants to know about an AI/algorithm? Are we missing anything?
- What other research and resources might we learn from?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[RFC] Create a new datachain pattern to describe an AI model or algorithm #228

Context

Problem

Proposal

“Decision Making” taxonomy category 🆕

Possible Elements

Contextual color

“Input/Output Datasets” taxonomy category 🆕

Possible Elements

Example Usage

“Processing Technology” category 🆕

Possible Elements

Location Context

“Risks & Mitigation” taxonomy category 🆕

Possible Elements

“Rights” taxonomy category 🆕

What’s missing and why

Request for Comments

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

	Sensor datachain	AI datachain	Contextual color
hexagon	Accountable	Accountable
hexagon	Purpose	Purpose
hexagon		Decision Making 🆕	Level of autonomy in decision-making
hexagon	(Data-collection) Technology		Data collected is personally identifiable
circle	Data Type [deprecate] 🗑️
circle		Input Datasets 🆕
circle	Processing (Technology)	Processing (Technology) ♻️ with location context
circle	Output Datasets 🆕	Output Datasets 🆕
square	Access	Access
square	Storage with location context	Storage with location context
octagon		Risks & Mitigation 🆕
octagon	Rights 🆕	Rights 🆕

[RFC] Create a new datachain pattern to describe an AI model or algorithm #228

Description

Context

Problem

Proposal

“Decision Making” taxonomy category 🆕

Possible Elements

Contextual color

“Input/Output Datasets” taxonomy category 🆕

Possible Elements

Example Usage

“Processing Technology” category 🆕

Possible Elements

Location Context

“Risks & Mitigation” taxonomy category 🆕

Possible Elements

“Rights” taxonomy category 🆕

What’s missing and why

Request for Comments

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions