-
Notifications
You must be signed in to change notification settings - Fork 0
Description
Overview
CrossCodeEval is a benchmark specifically designed to test code completion when cross-file context is required. This directly measures what Supermodel is built for.
Unlike SWE-bench (bug fixing) or HumanEval (single-file), CrossCodeEval isolates the cross-file understanding problem.
Why This Benchmark
The finding that matters: "Better retrieval methods are needed for optimal results"
Their results show:
- Models fail without cross-file context
- Performance jumps with good context retrieval
- Big gap between text-based retrieval and oracle (perfect context)
Our hypothesis: Graph-based retrieval outperforms text-based retrieval for finding relevant cross-file context.
Benchmark Structure
Languages: Python, Java, TypeScript, C#
Metrics:
- Exact Match (EM)
- Edit Similarity (ES)
- Identifier Match (API prediction accuracy)
Test scenarios:
- In-file only (baseline)
- Retrieved context (BM25, embedding similarity, etc.)
- Oracle context (upper bound - perfect retrieval)
Proposed Methods
Method 1: Import Graph Retrieval
Use dependency graph to find context.
Task: Complete code that uses `UserService.authenticate()`
Text retrieval: Search for "authenticate" → noisy results
Graph retrieval:
1. Parse incomplete code, find undefined `UserService`
2. Query dependency graph: "What defines UserService?"
3. Return: `services/user_service.py` with `authenticate()` definition
Method 2: Call Chain Retrieval
Use call graph to find related functions.
Task: Complete function body that should call helper functions
Graph retrieval:
1. Find similar functions in codebase
2. Query call graph: "What do similar functions call?"
3. Return: Common callees as candidate APIs to use
Method 3: Domain-Scoped Retrieval
Use domain graph to narrow search space.
Task: Complete code in authentication module
Graph retrieval:
1. Identify current file's domain (e.g., "Authentication")
2. Retrieve all files/functions in same domain
3. Prioritize context from same architectural area
Method 4: Hybrid (Graph + Text)
Use graph to filter, then rank with text similarity.
1. Graph narrows candidates (structurally relevant files)
2. Text similarity ranks within candidates
3. Best of both worlds
Eval Design
Treatments
| Treatment | Context Retrieval Method |
|---|---|
| Baseline | In-file only |
| BM25 | Text search (their baseline) |
| Embedding | Semantic similarity |
| Dependency Graph | Import relationships |
| Call Graph | Function call relationships |
| Domain Graph | Architectural grouping |
| Hybrid | Graph filter + text rank |
| Oracle | Perfect context (upper bound) |
Implementation
- Get CrossCodeEval dataset from their repo
- For each task, generate graph for the repo
- Implement each retrieval method
- Measure EM, ES, Identifier Match
- Compare against their published baselines
Success Criteria
Graph-based retrieval should:
- Outperform BM25 and embedding baselines
- Close the gap toward oracle performance
- Show particular strength on tasks requiring API/import understanding
Resources
- Paper: CrossCodeEval: A Diverse and Multilingual Benchmark for Cross-File Code Completion
- Dataset: Available via their GitHub
- Their baselines: BM25, UniXcoder embeddings
Why This Is Better Than SWE-bench
| Aspect | SWE-bench | CrossCodeEval |
|---|---|---|
| Task | Bug fixing | Code completion |
| Isolation | Many confounds | Isolates cross-file problem |
| Oracle | None | Has perfect-context upper bound |
| Training data | Popular repos (likely memorized) | Constructed to avoid memorization |
| What it tests | End-to-end agent | Specifically context retrieval |
CrossCodeEval lets us prove graphs help with retrieval, independent of other agent capabilities.