Skip to content

akshay-gup/verl_rl_canvas

Repository files navigation

metalanguage

CodeForces Multi-Problem Reward Function

Setup

pip install requests pyarrow pandas filelock

# Download generated tests (~110GB) — optional but recommended
pip install -U "huggingface_hub[cli,hf_xet]"
huggingface-cli download open-r1/codeforces \
    --repo-type=dataset \
    --include='generated_tests/*.parquet' \
    --max-workers=8 \
    --local-dir /path/to/cf_data

Environment Variables

Variable Required Default Description
PISTON_ENDPOINT Yes http://localhost:2000 Piston sandbox URL
CF_GENERATED_TESTS No "" Path to dir containing generated_tests/*.parquet
CF_MAX_GENERATED_TESTS No 3 Max generated tests per problem (keep low for training speed)
PISTON_TIMEOUT No 30 Seconds before Piston request times out

Integration

Option A: Drop into your existing verl_reward_func

# In train_grpo_limr_zero3.py, add to verl_reward_func:
elif data_source == "open-r1/codeforces":
    from codeforces_reward import compute_score_codeforces
    return compute_score_codeforces(solution_str, ground_truth, extra_info)

Option B: Use the bundled dispatcher

from codeforces_reward import verl_reward_func
# handles all data_sources including codeforces

How It Works

Model output                    Ground truth (from dataset builder)
─────────────                   ───────────────────────────────────
## Problem 1                    gt_list[0] = {
<reasoning>                       "id": "1234_A",
```python                         "tests": [{input, output}, ...],
<code>                            "generated_checker": "...",
```                               "time_limit": 2.0,
                                  ...
## Problem 2                    }
...                             gt_list[1] = { ... }
  1. Parse: Split output on ## Problem K headers → extract last code block per section
  2. Load tests: Official tests from ground_truth dict + generated tests from parquet (lazy-cached)
  3. Execute: Send each solution to Piston with test input, expected output, checker, and limits
  4. Score: num_problems_correct / total_problems → reward ∈ [0, 1]

Scoring Modes

  • Binary per-problem (default): 1.0 only if ALL tests pass for that problem
  • Partial credit per-problem: Uncomment the alternative return in _score_single_cf_problem
  • Overall: Always returns fraction of problems fully solved

Notes

  • fail_fast=True stops testing a problem on first failure (faster during training)
  • Generated tests are cached in memory per-contest after first load
  • The function handles both single-problem and multi-problem formats automatically
  • Custom checkers are passed to Piston as checker.py when present

About

Minimal framework for LLMs to build persistent reasoning modules, enabling modularity, credit assignment, and the emergence of higher-order reasoning protocols.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors