Data Governance Agent is an agentic data governance system built using a multi-agent architecture. The system allows users to connect data sources, such as databases, file systems, and S3 buckets (future), and perform governance tasks including compliance auditing, PII detection, and natural-language querying.
The architecture consists of:
- A main orchestrator agent
- Multiple sub-agents connected as tools through an MCP Server
- Each sub-agent also has its own MCP server with tools designed specifically for its task
- Connect structured and unstructured data sources
- Perform automated data governance audits
- Detect PII and quasi-PII data
- Validate data sources against federal, state, and internal policy frameworks
- Query data sources and metadata bout them using natural language
- Generate structured governance and audit reports
- Multi-agent architecture with dynamic tool routing
The system uses two execution strategies depending on the task.
When a user submits a natural-language query, the orchestrator agent:
- Receives the query
- Dynamically decides the workflow at runtime
- Hands off the next step to the relevant sub-agent
- Executes the task autonomously
This allows flexible querying without requiring predefined workflows.
The audit pipeline follows a predefined workflow.
This approach is used because allowing an LLM to fully determine an audit workflow can introduce errors. Instead, the workflow is defined programmatically, while the LLM is used only for the tasks it performs reliably, such as classification, policy matching, and reasoning.
The user specifies:
- The data source to audit
- The frameworks to validate the source against
Frameworks are documents that contain federal, state, or internal policy rules related to data governance and compliance.
These documents are:
- Uploaded through the frontend
- Processed by the Chunker in
services/RAG_services - Chunked, vectorized, and stored in the vector database
- Each chunk is assigned a hash so that if the same chunk is uploaded again, it will not be embedded twice
Once the request reaches the API, the audit pipeline begins.
The DB agent generates a report describing the data source, including:
- Structure and schema
- Column information
- Sample values
- Metadata
The classification agent analyzes the source report and classifies the data into:
- PII
- Quasi-PII
- Safe data that can be stored as plain text
This step produces a classification report.
The classification report is then passed to the policy agent. This agent queries the vector store and searches for policies related to the detected data types in the frameworks selected by the user.
Example: If an internal policy document says "all user names should be encrypted", but the data report shows raw values, the policy agent will mark this as a violation.
This agent also checks metadata-level violations.
Example: If metadata fields such as:
data_retention_policy = Nonedata source owner = Noneaccess control = all teams
exist, and there is a policy such as "all databases must have an owner, a data retention policy, and controlled access control", the agent will mark these as violations.
The remediation agent reviews all identified violations and searches the policy documents for recommended fixes.
Example: If a violation is identified as "user names stored as raw values", and the policy document contains a remediation such as "all user names must be encrypted or redacted", the agent will add this to the remediation list.
Future Development:
A planned enhancement is to allow the system to automatically implement remediations after receiving human approval through MCP server connections.
All outputs from the previous steps are combined into a single structured report that includes:
- Source analysis
- Data classification
- Policy violations
- Recommended remediations
User → Frontend → API → Orchestrator Agent
│
├── DB Agent
├── Classification Agent
├── Policy Agent
├── Remediation Agent
The vector store is used by the policy and remediation agents through the RAG services and is not part of the agent layer itself.
project-root/
│
├── agents/
│ ├── orchestrator
│ └── sub_agents
│ ├── db_agent
│ ├── classification_agent
│ ├── policy_agent
│ └── remediation_agent
│
├── mcp_connection/
│ └── mcp_client.py
| └── server/
│ ├── db_server.py
│ ├── file_server.py
│ └── main_mcp_server.py # exposes the sub-agents as tools to the orchestrator
│
├── services/
│ └── RAG_services/
│
├── tools/
│ └── tool_executor/
│
├── api/
├── frontend/
└── docker/
The project follows several engineering best practices:
-
Multi-agent architecture with sub-agents exposed as tools
-
Tool executor layer between the agents and the MCP server
-
Loosely coupled components for easier testing and debugging
-
Object-oriented design principles, including:
- Abstract classes
- Polymorphism through function overriding
- Design patterns such as Factory and Adapter
-
Containerized components
Planned improvements include:
- Completing the frontend with an improved UI/UX
- Improving model performance
- Adding Redis caching for faster reads
- Optimizing database indexing and read/write performance for scalability
- Fixing concurrency issues that currently affect output quality in some parts of the audit pipeline