Skip to content

Eval: CrossCodeEval benchmark for cross-file context retrieval #88

@jonathanpopham

Description

@jonathanpopham

Overview

CrossCodeEval is a benchmark specifically designed to test code completion when cross-file context is required. This directly measures what Supermodel is built for.

Unlike SWE-bench (bug fixing) or HumanEval (single-file), CrossCodeEval isolates the cross-file understanding problem.

Why This Benchmark

The finding that matters: "Better retrieval methods are needed for optimal results"

Their results show:

  • Models fail without cross-file context
  • Performance jumps with good context retrieval
  • Big gap between text-based retrieval and oracle (perfect context)

Our hypothesis: Graph-based retrieval outperforms text-based retrieval for finding relevant cross-file context.

Benchmark Structure

Languages: Python, Java, TypeScript, C#

Metrics:

  • Exact Match (EM)
  • Edit Similarity (ES)
  • Identifier Match (API prediction accuracy)

Test scenarios:

  1. In-file only (baseline)
  2. Retrieved context (BM25, embedding similarity, etc.)
  3. Oracle context (upper bound - perfect retrieval)

Proposed Methods

Method 1: Import Graph Retrieval

Use dependency graph to find context.

Task: Complete code that uses `UserService.authenticate()`

Text retrieval: Search for "authenticate" → noisy results
Graph retrieval: 
  1. Parse incomplete code, find undefined `UserService`
  2. Query dependency graph: "What defines UserService?"
  3. Return: `services/user_service.py` with `authenticate()` definition

Method 2: Call Chain Retrieval

Use call graph to find related functions.

Task: Complete function body that should call helper functions

Graph retrieval:
  1. Find similar functions in codebase
  2. Query call graph: "What do similar functions call?"
  3. Return: Common callees as candidate APIs to use

Method 3: Domain-Scoped Retrieval

Use domain graph to narrow search space.

Task: Complete code in authentication module

Graph retrieval:
  1. Identify current file's domain (e.g., "Authentication")
  2. Retrieve all files/functions in same domain
  3. Prioritize context from same architectural area

Method 4: Hybrid (Graph + Text)

Use graph to filter, then rank with text similarity.

1. Graph narrows candidates (structurally relevant files)
2. Text similarity ranks within candidates
3. Best of both worlds

Eval Design

Treatments

Treatment Context Retrieval Method
Baseline In-file only
BM25 Text search (their baseline)
Embedding Semantic similarity
Dependency Graph Import relationships
Call Graph Function call relationships
Domain Graph Architectural grouping
Hybrid Graph filter + text rank
Oracle Perfect context (upper bound)

Implementation

  1. Get CrossCodeEval dataset from their repo
  2. For each task, generate graph for the repo
  3. Implement each retrieval method
  4. Measure EM, ES, Identifier Match
  5. Compare against their published baselines

Success Criteria

Graph-based retrieval should:

  • Outperform BM25 and embedding baselines
  • Close the gap toward oracle performance
  • Show particular strength on tasks requiring API/import understanding

Resources

Why This Is Better Than SWE-bench

Aspect SWE-bench CrossCodeEval
Task Bug fixing Code completion
Isolation Many confounds Isolates cross-file problem
Oracle None Has perfect-context upper bound
Training data Popular repos (likely memorized) Constructed to avoid memorization
What it tests End-to-end agent Specifically context retrieval

CrossCodeEval lets us prove graphs help with retrieval, independent of other agent capabilities.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions