Skip to content

Add data source registry and update related configurations#90

Draft
AjayThorve wants to merge 4 commits intoNVIDIA-AI-Blueprints:developfrom
AjayThorve:feat/data-source-registry
Draft

Add data source registry and update related configurations#90
AjayThorve wants to merge 4 commits intoNVIDIA-AI-Blueprints:developfrom
AjayThorve:feat/data-source-registry

Conversation

@AjayThorve
Copy link
Copy Markdown
Collaborator

Replace hardcoded data source pattern matching with a config-driven registry.

Problem

Data source filtering (web search, knowledge base, etc.) relied on fragile string pattern matching — checking if tool names contain substrings like "web", "tavily", or "knowledge". This caused real bugs:

  • Knowledge layer's inner function is named search"knowledge" in "search" fails
  • A customer naming their tool awesome_retriever silently breaks filtering
  • "web" in "web_calculator" produces false positives
  • Adding any new data source required editing 3 central files (data_sources.py, jobs.py, pattern lists)

Solution

A new data_source_registry NAT function type that makes data sources fully config-driven:

functions:
  data_sources:
    _type: data_source_registry
    sources:
      - id: web_search
        name: "Web Search"
        description: "Search the web for real-time information."
        tools:
          - web_search_tool
          - advanced_web_search_tool
      - id: knowledge_layer
        name: "Knowledge Base"
        description: "Search uploaded documents and files."
        tools:
          - knowledge_search
  • Tool→source mapping is explicit — declared in YAML, no pattern matching
  • Display metadata is config-driven — name/description overridable per deployment
  • Function groups supported — auto-detected at startup via builder (e.g. ECI group expands to eci_confluence, eci_gdrive with prefix matching)
  • NAT-native — uses FunctionRef for tool references, registered as a NAT plugin via entry points, validated at config load time
  • Zero central code changes to add a source — just add a YAML entry

UI Integration

All registered data sources are automatically available via the GET /v1/data_sources API endpoint, which the new UI consumes to render source toggles. Any source added to the YAML config will appear in the UI
with its configured display name and description — no frontend changes needed. Per-message filtering via data_sources: ["web_search", "knowledge_layer"] in the chat payload continues to work as before.

Backward Compatibility is maintained.

Test plan
  • pytest tests/ -v — 681 passed, 0 failures
  • NAT config loads successfully (data_source_registry discovered via entry point)
  • GET /v1/data_sources returns dynamically discovered sources
  • Submit query with data_sources: ["web_search"] — only web tools active
  • Submit query without data_sources — all tools active (regression)

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps bot commented Feb 25, 2026

Greptile Summary

Replaces fragile string pattern matching ("web" in tool_name) with a config-driven registry that explicitly maps tools to data sources via YAML declarations. The new data_source_registry NAT function type auto-detects function groups at startup and supports deterministic prefix matching (longest-first) for overlapping group names.

Key improvements:

  • Eliminates false positives/negatives from substring matching (e.g., "knowledge" in "search" failing, "web" in "web_calculator" false positive)
  • Zero central code changes to add new data sources — just add YAML entry
  • Full MCP tool support via function group prefix matching with NAT separators (__ and legacy .)
  • Display metadata (name, description) now config-driven and overridable per deployment
  • Comprehensive test coverage with 193 new test lines for registry edge cases

Backward compatibility:

  • Tools without registry mappings (e.g., "think", calculator) always included regardless of filtering
  • Existing data_sources: ["web_search"] payload filtering continues working
  • All 681 tests passing, no breaking changes

Confidence Score: 5/5

  • Safe to merge with minimal risk — well-tested architectural improvement with backward compatibility
  • Comprehensive test coverage (193 new test lines, all 681 tests passing), addresses documented real bugs, maintains backward compatibility, clean architectural separation, and thorough documentation updates
  • No files require special attention

Important Files Changed

Filename Overview
src/aiq_agent/common/data_source_registry.py New file implementing config-driven data source registry with NAT integration, function group support, and deterministic prefix matching
src/aiq_agent/common/data_sources.py Replaced string pattern matching with registry-based tool filtering, improved metadata lookup with fallback for unregistered sources
frontends/aiq_api/src/aiq_api/routes/jobs.py Simplified /v1/data_sources endpoint to use registry, removed hardcoded source definitions and tool name collection logic
tests/aiq_agent/common/test_data_source_registry.py Comprehensive test coverage for registry, tool mapping, group prefix matching, and overlapping prefix handling
docs/source/extending/adding-a-data-source.md Updated documentation to explain registry-based data source registration, MCP tool integration, removed pattern matching references

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[YAML Config] -->|data_source_registry| B[NAT WorkflowBuilder.from_config]
    B -->|Calls register function| C[data_source_registry_fn]
    C -->|Detects function groups| D[builder._function_groups]
    C -->|Populates| E[Global Registry State]
    E -->|_registry| F[DataSourceMeta objects]
    E -->|_tool_source_map| G[Exact tool → source ID]
    E -->|_group_source_map| H[Group prefix → source ID]
    
    I[GET /v1/data_sources] -->|get_all_sources| F
    
    J[Chat Request] -->|data_sources filter| K[filter_tools_by_sources]
    K -->|get_source_id_for_tool| G
    K -->|Prefix match longest-first| H
    K -->|Filtered tools| L[Agent Execution]
    
    M[Unknown tools] -->|source_id = None| N[Always included]
    N --> L
Loading

Last reviewed commit: 0dd5bd2

Copy link
Copy Markdown
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

11 files reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

@AjayThorve AjayThorve force-pushed the feat/data-source-registry branch from 77edd8f to 734ff44 Compare February 25, 2026 04:58
Copy link
Copy Markdown
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

11 files reviewed, no comments

Edit Code Review Agent Settings | Greptile

Copy link
Copy Markdown
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

14 files reviewed, no comments

Edit Code Review Agent Settings | Greptile

…ion and improving group name handling for deterministic prefix matching. Update related functions to utilize the new structure and ensure consistent tool mapping.
Copy link
Copy Markdown
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

14 files reviewed, no comments

Edit Code Review Agent Settings | Greptile

@cdgamarose-nv
Copy link
Copy Markdown
Collaborator

I'm wondering if we could reduce the number of places we need to change config by registering the data source in each agent code itself if the tool is set by the agent config. Right now, we already repeatedly list tools in each agent function. This would make the config harder to maintain as tools/agents increase. Is this feasible?

@AjayThorve
Copy link
Copy Markdown
Collaborator Author

I'm wondering if we could reduce the number of places we need to change config by registering the data source in each agent code itself if the tool is set by the agent config. Right now, we already repeatedly list tools in each agent function. This would make the config harder to maintain as tools/agents increase. Is this feasible?

good feedback, let me see if that can be done

@AjayThorve AjayThorve marked this pull request as draft February 25, 2026 22:01
@AjayThorve AjayThorve added enhancement New feature or request AIQ2.0 Issues specific to v2.0 labels Feb 25, 2026
@AjayThorve AjayThorve added AIQ2.1 and removed AIQ2.0 Issues specific to v2.0 labels Mar 6, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

AIQ2.1 enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants