Eval: CrossCodeEval benchmark for cross-file context retrieval

## Overview

[CrossCodeEval](https://crosscodeeval.github.io/) is a benchmark specifically designed to test code completion when cross-file context is required. This directly measures what Supermodel is built for.

Unlike SWE-bench (bug fixing) or HumanEval (single-file), CrossCodeEval isolates the cross-file understanding problem.

## Why This Benchmark

**The finding that matters:** "Better retrieval methods are needed for optimal results"

Their results show:
- Models fail without cross-file context
- Performance jumps with good context retrieval
- Big gap between text-based retrieval and oracle (perfect context)

**Our hypothesis:** Graph-based retrieval outperforms text-based retrieval for finding relevant cross-file context.

## Benchmark Structure

**Languages:** Python, Java, TypeScript, C#

**Metrics:**
- Exact Match (EM)
- Edit Similarity (ES)
- Identifier Match (API prediction accuracy)

**Test scenarios:**
1. In-file only (baseline)
2. Retrieved context (BM25, embedding similarity, etc.)
3. Oracle context (upper bound - perfect retrieval)

## Proposed Methods

### Method 1: Import Graph Retrieval
Use dependency graph to find context.

```
Task: Complete code that uses `UserService.authenticate()`

Text retrieval: Search for "authenticate" → noisy results
Graph retrieval: 
  1. Parse incomplete code, find undefined `UserService`
  2. Query dependency graph: "What defines UserService?"
  3. Return: `services/user_service.py` with `authenticate()` definition
```

### Method 2: Call Chain Retrieval
Use call graph to find related functions.

```
Task: Complete function body that should call helper functions

Graph retrieval:
  1. Find similar functions in codebase
  2. Query call graph: "What do similar functions call?"
  3. Return: Common callees as candidate APIs to use
```

### Method 3: Domain-Scoped Retrieval
Use domain graph to narrow search space.

```
Task: Complete code in authentication module

Graph retrieval:
  1. Identify current file's domain (e.g., "Authentication")
  2. Retrieve all files/functions in same domain
  3. Prioritize context from same architectural area
```

### Method 4: Hybrid (Graph + Text)
Use graph to filter, then rank with text similarity.

```
1. Graph narrows candidates (structurally relevant files)
2. Text similarity ranks within candidates
3. Best of both worlds
```

## Eval Design

### Treatments

| Treatment | Context Retrieval Method |
|-----------|-------------------------|
| Baseline | In-file only |
| BM25 | Text search (their baseline) |
| Embedding | Semantic similarity |
| Dependency Graph | Import relationships |
| Call Graph | Function call relationships |
| Domain Graph | Architectural grouping |
| Hybrid | Graph filter + text rank |
| Oracle | Perfect context (upper bound) |

### Implementation

1. Get CrossCodeEval dataset from their repo
2. For each task, generate graph for the repo
3. Implement each retrieval method
4. Measure EM, ES, Identifier Match
5. Compare against their published baselines

## Success Criteria

Graph-based retrieval should:
- Outperform BM25 and embedding baselines
- Close the gap toward oracle performance
- Show particular strength on tasks requiring API/import understanding

## Resources

- Paper: [CrossCodeEval: A Diverse and Multilingual Benchmark for Cross-File Code Completion](https://arxiv.org/abs/2310.11248)
- Dataset: Available via their GitHub
- Their baselines: BM25, UniXcoder embeddings

## Why This Is Better Than SWE-bench

| Aspect | SWE-bench | CrossCodeEval |
|--------|-----------|---------------|
| Task | Bug fixing | Code completion |
| Isolation | Many confounds | Isolates cross-file problem |
| Oracle | None | Has perfect-context upper bound |
| Training data | Popular repos (likely memorized) | Constructed to avoid memorization |
| What it tests | End-to-end agent | Specifically context retrieval |

CrossCodeEval lets us prove graphs help with retrieval, independent of other agent capabilities.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Eval: CrossCodeEval benchmark for cross-file context retrieval #88

Overview

Why This Benchmark

Benchmark Structure

Proposed Methods

Method 1: Import Graph Retrieval

Method 2: Call Chain Retrieval

Method 3: Domain-Scoped Retrieval

Method 4: Hybrid (Graph + Text)

Eval Design

Treatments

Implementation

Success Criteria

Resources

Why This Is Better Than SWE-bench

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Treatment	Context Retrieval Method
Baseline	In-file only
BM25	Text search (their baseline)
Embedding	Semantic similarity
Dependency Graph	Import relationships
Call Graph	Function call relationships
Domain Graph	Architectural grouping
Hybrid	Graph filter + text rank
Oracle	Perfect context (upper bound)

Aspect	SWE-bench	CrossCodeEval
Task	Bug fixing	Code completion
Isolation	Many confounds	Isolates cross-file problem
Oracle	None	Has perfect-context upper bound
Training data	Popular repos (likely memorized)	Constructed to avoid memorization
What it tests	End-to-end agent	Specifically context retrieval

Eval: CrossCodeEval benchmark for cross-file context retrieval #88

Description

Overview

Why This Benchmark

Benchmark Structure

Proposed Methods

Method 1: Import Graph Retrieval

Method 2: Call Chain Retrieval

Method 3: Domain-Scoped Retrieval

Method 4: Hybrid (Graph + Text)

Eval Design

Treatments

Implementation

Success Criteria

Resources

Why This Is Better Than SWE-bench

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions