Skip to content

How to exactly reproduce the eval results?Β #18

@yucc-leon

Description

@yucc-leon

Hi, I ran the training script, and it took about 2.6k steps ending with an OOM error. (Is this normal by the way? )
And when I evaluated LCBv5, the results didn't match: 27.5 vs 29.5

I thought mismatched templates caused it, so I added configs to the LCB code:

.../LiveCodeBench/lcb_runner/prompts/code_generation.py:

    SYSTEM_MESSAGE_CODER1QWEN = (
        f"""<|im_start|>system\nYou are a helpful programming assistant. \
The user will ask you a question and you as the assistant solve it. \
The assistant first thinks how to solve the task through reasoning and then provides the user with the final answer. \
The reasoning process and answer are enclosed within <think>...</think> and <answer>...</answer> tags, respectively.<|im_end|>\n<|im_start|>user"""
    ) 
# adapted from the `SYSTEM_PROMPT` in `.../code-r1/examples/data_preprocess/coder1.py` and `SYSTEM_MESSAGE_CODEQWEN` in `code_generation.py`


def get_coder1qwen_question_template_answer(question: CodeGenerationProblem):
    prompt = "Please solve the programming task below using a self-contained code snippet in a markdown code block.\n\n"
    prompt += f"Question: {question.question_content}\n\n"
    if question.starter_code:
        prompt += f"{PromptConstants.FORMATTING_MESSAGE_WITH_STARTER_CODE}\n"
        prompt += f"```python\n{question.starter_code}\n```\n\n<|im_end|>\n"
    else:
        prompt += f"{PromptConstants.FORMATTING_WITHOUT_STARTER_CODE}\n"
        prompt += f"```python\n# YOUR CODE HERE\n```\n\n<|im_end|>\n"
    prompt += f"<|im_start|>assistant\n"
    return prompt
# adapted from the get_codeqwen_question_template_answer function

and registered a new LM template in .../LiveCodeBench/lcb_runner/lm_styles.py

    LanguageModel(
        "ganler/CodeR1-Zero-Qwen2.5-7B-LC2k-1088",
        "CodeR1QwenInst",
        LMStyle.CodeR1QwenInst,
        datetime(2025, 4, 21),
        ""
    )

then ran

python -m lcb_runner.runner.main \
    --model ganler/CodeR1-Zero-Qwen2.5-7B-LC2k-1088 \
    --local_model_path ganler/CodeR1-Zero-Qwen2.5-7B-LC2k-1088 \
    --scenario codegeneration \
    --evaluate \
    --tensor_parallel_size 1 \
    --release_version release_v5

But it still ended up with 29.17 compared to 29.5.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions