There are default values in evaluation repo for DATASET / DATASET_SPLIT / GAIA_LEVEL / etc ...
And then those values have defaults again in benchmarks repo.
This duplication causes errors since someone must change a value in two very different places to update a parameter.
- the evaluation repository should not define any of those values. It should rely on the default values of the benchmarks repo for run-infer run-eval
- all such values should be in a
{benchmarks}/config.py
- all such values should be saved in the artifacts for traceability
Also, SWTBench should have
- DATASET_INFER =
eth-sri/SWT-bench_Verified_bm25_27k_zsp
- DATASET_EVAL =
princeton-nlp/SWE-bench_Verified