How to exactly reproduce the eval results?

Hi, I ran the training script, and it took about 2.6k steps ending with an OOM error. (Is this normal by the way? )
And when I evaluated LCBv5, the results didn't match: 27.5 vs 29.5 

I thought mismatched templates caused it, so I added configs to the LCB code:

`.../LiveCodeBench/lcb_runner/prompts/code_generation.py`:
```
    SYSTEM_MESSAGE_CODER1QWEN = (
        f"""<|im_start|>system\nYou are a helpful programming assistant. \
The user will ask you a question and you as the assistant solve it. \
The assistant first thinks how to solve the task through reasoning and then provides the user with the final answer. \
The reasoning process and answer are enclosed within <think>...</think> and <answer>...</answer> tags, respectively.<|im_end|>\n<|im_start|>user"""
    ) 
# adapted from the `SYSTEM_PROMPT` in `.../code-r1/examples/data_preprocess/coder1.py` and `SYSTEM_MESSAGE_CODEQWEN` in `code_generation.py`


def get_coder1qwen_question_template_answer(question: CodeGenerationProblem):
    prompt = "Please solve the programming task below using a self-contained code snippet in a markdown code block.\n\n"
    prompt += f"Question: {question.question_content}\n\n"
    if question.starter_code:
        prompt += f"{PromptConstants.FORMATTING_MESSAGE_WITH_STARTER_CODE}\n"
        prompt += f"```python\n{question.starter_code}\n```\n\n<|im_end|>\n"
    else:
        prompt += f"{PromptConstants.FORMATTING_WITHOUT_STARTER_CODE}\n"
        prompt += f"```python\n# YOUR CODE HERE\n```\n\n<|im_end|>\n"
    prompt += f"<|im_start|>assistant\n"
    return prompt
# adapted from the get_codeqwen_question_template_answer function
```

and registered a new LM template in `.../LiveCodeBench/lcb_runner/lm_styles.py`
```
    LanguageModel(
        "ganler/CodeR1-Zero-Qwen2.5-7B-LC2k-1088",
        "CodeR1QwenInst",
        LMStyle.CodeR1QwenInst,
        datetime(2025, 4, 21),
        ""
    )
```

then ran 
```
python -m lcb_runner.runner.main \
    --model ganler/CodeR1-Zero-Qwen2.5-7B-LC2k-1088 \
    --local_model_path ganler/CodeR1-Zero-Qwen2.5-7B-LC2k-1088 \
    --scenario codegeneration \
    --evaluate \
    --tensor_parallel_size 1 \
    --release_version release_v5
```

But it still ended up with `29.17` compared to `29.5`.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to exactly reproduce the eval results? #18

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

How to exactly reproduce the eval results? #18

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions