Eval: Test multiple graph data formats to find optimal agent representation

## Problem

We don't know if the current graph format is optimal for agent consumption. The raw API response might not be the best way to present structural information to an LLM.

Before changing the API, we need to prove which data format actually helps agents most.

## Approach

**One API call → Multiple transformations → Cached files → Comparative evals**

```
┌─────────────┐
│  API Call   │  (one time, expensive)
└──────┬──────┘
       │
       ▼
┌─────────────┐
│  Raw Graph  │  (cache as file)
└──────┬──────┘
       │
       ├──► Format A (e.g., nested JSON)
       ├──► Format B (e.g., flat adjacency list)
       ├──► Format C (e.g., natural language summary)
       ├──► Format D (e.g., markdown with code blocks)
       └──► Format E (e.g., compressed/filtered view)
       
       │
       ▼
┌─────────────────────────────────────┐
│  Eval: Baseline vs A vs B vs C ...  │
└─────────────────────────────────────┘
```

## Hypotheses to Test

### Format Variations

1. **Raw JSON** (current) - Full graph structure as-is
2. **Adjacency list** - Simple `caller → [callees]` format
3. **Natural language** - "Function X calls Y, Z. Function Y is called by X, W."
4. **Hierarchical markdown** - Organized by domain/file with indentation
5. **Filtered/minimal** - Only nodes relevant to the query, no noise
6. **Compressed summary** - Stats + top-level structure, expandable on demand
7. **Code-centric** - File paths and line numbers prominent, relationships secondary

### Presentation Variations

1. **Injected in system prompt** - Graph as persistent context
2. **Returned from tool call** - Agent explicitly requests it
3. **Hybrid** - Summary in system prompt, details via tool

## Implementation

### Step 1: Agent Format Converter
Build a transformer that takes raw graph and outputs multiple formats:

```typescript
interface GraphFormatter {
  name: string;
  transform(raw: SupermodelIR): string;
}

const formatters: GraphFormatter[] = [
  { name: 'raw_json', transform: (g) => JSON.stringify(g) },
  { name: 'adjacency', transform: (g) => toAdjacencyList(g) },
  { name: 'natural_language', transform: (g) => toNaturalLanguage(g) },
  { name: 'markdown', transform: (g) => toMarkdown(g) },
  // ...
];
```

### Step 2: Cache All Formats
For each benchmark repo, generate and cache all formats:
```
cache/
  django/
    raw.json
    adjacency.txt
    natural_language.txt
    markdown.md
    ...
```

### Step 3: Run Comparative Evals
- Baseline: No graph
- Treatment A: raw.json injected
- Treatment B: adjacency.txt injected
- Treatment C: natural_language.txt injected
- ...

### Metrics
- Task completion rate
- Iterations to solution
- Accuracy of code placement
- Token efficiency (some formats are smaller)

## Why This Matters

We might have the right data but the wrong presentation. 

Possible findings:
- "Natural language summaries outperform raw JSON by 40%"
- "Adjacency lists are 10x smaller and perform the same"
- "Markdown format helps agents reason about structure"

This tells us what to build, not just whether graphs help.

## Success Criteria

Identify at least one format that shows statistically significant improvement over:
1. Baseline (no graph)
2. Raw JSON (current format)

## Related

- #82 - Local graph.json file (caching mechanism)
- #83 - Focus on call graph + classification

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Eval: Test multiple graph data formats to find optimal agent representation #87

Problem

Approach

Hypotheses to Test

Format Variations

Presentation Variations

Implementation

Step 1: Agent Format Converter

Step 2: Cache All Formats

Step 3: Run Comparative Evals

Metrics

Why This Matters

Success Criteria

Related

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Eval: Test multiple graph data formats to find optimal agent representation #87

Description

Problem

Approach

Hypotheses to Test

Format Variations

Presentation Variations

Implementation

Step 1: Agent Format Converter

Step 2: Cache All Formats

Step 3: Run Comparative Evals

Metrics

Why This Matters

Success Criteria

Related

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions