Skip to content

Eval: Test multiple graph data formats to find optimal agent representation #87

@jonathanpopham

Description

@jonathanpopham

Problem

We don't know if the current graph format is optimal for agent consumption. The raw API response might not be the best way to present structural information to an LLM.

Before changing the API, we need to prove which data format actually helps agents most.

Approach

One API call → Multiple transformations → Cached files → Comparative evals

┌─────────────┐
│  API Call   │  (one time, expensive)
└──────┬──────┘
       │
       ▼
┌─────────────┐
│  Raw Graph  │  (cache as file)
└──────┬──────┘
       │
       ├──► Format A (e.g., nested JSON)
       ├──► Format B (e.g., flat adjacency list)
       ├──► Format C (e.g., natural language summary)
       ├──► Format D (e.g., markdown with code blocks)
       └──► Format E (e.g., compressed/filtered view)
       
       │
       ▼
┌─────────────────────────────────────┐
│  Eval: Baseline vs A vs B vs C ...  │
└─────────────────────────────────────┘

Hypotheses to Test

Format Variations

  1. Raw JSON (current) - Full graph structure as-is
  2. Adjacency list - Simple caller → [callees] format
  3. Natural language - "Function X calls Y, Z. Function Y is called by X, W."
  4. Hierarchical markdown - Organized by domain/file with indentation
  5. Filtered/minimal - Only nodes relevant to the query, no noise
  6. Compressed summary - Stats + top-level structure, expandable on demand
  7. Code-centric - File paths and line numbers prominent, relationships secondary

Presentation Variations

  1. Injected in system prompt - Graph as persistent context
  2. Returned from tool call - Agent explicitly requests it
  3. Hybrid - Summary in system prompt, details via tool

Implementation

Step 1: Agent Format Converter

Build a transformer that takes raw graph and outputs multiple formats:

interface GraphFormatter {
  name: string;
  transform(raw: SupermodelIR): string;
}

const formatters: GraphFormatter[] = [
  { name: 'raw_json', transform: (g) => JSON.stringify(g) },
  { name: 'adjacency', transform: (g) => toAdjacencyList(g) },
  { name: 'natural_language', transform: (g) => toNaturalLanguage(g) },
  { name: 'markdown', transform: (g) => toMarkdown(g) },
  // ...
];

Step 2: Cache All Formats

For each benchmark repo, generate and cache all formats:

cache/
  django/
    raw.json
    adjacency.txt
    natural_language.txt
    markdown.md
    ...

Step 3: Run Comparative Evals

  • Baseline: No graph
  • Treatment A: raw.json injected
  • Treatment B: adjacency.txt injected
  • Treatment C: natural_language.txt injected
  • ...

Metrics

  • Task completion rate
  • Iterations to solution
  • Accuracy of code placement
  • Token efficiency (some formats are smaller)

Why This Matters

We might have the right data but the wrong presentation.

Possible findings:

  • "Natural language summaries outperform raw JSON by 40%"
  • "Adjacency lists are 10x smaller and perform the same"
  • "Markdown format helps agents reason about structure"

This tells us what to build, not just whether graphs help.

Success Criteria

Identify at least one format that shows statistically significant improvement over:

  1. Baseline (no graph)
  2. Raw JSON (current format)

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions