Skip to content

Discrepancy in Evaluation Results from top_scores.py #2

@w22cao

Description

@w22cao

Hello, I have a question regarding the evaluation procedure described in III. Obtain some top scores.

I tried using top_scores.py to compute the localization results, but I noticed some issues in the script. For example, the repository does not contain a logs_path folder. After replacing logs_path with model_logs and adjusting the script accordingly, I was able to run the evaluation. However, the results I obtained do not match those reported in the paper.

Specifically, for the 16B model, the paper reports:

LLMAO with CodeGen-16B
top-1: 88 (22.3%)
top-3: 149 (37.7%)
top-5: 183 (46.3%)

But my output was:

Top 1,3,5 of 395 total bugs for defects4j-16B:
[57 (14.4%) & 76 (19.2%) & 90 (22.8%)]

I’m wondering if I might be using the wrong folder, or if the contents of model_logs need to be updated. Could you please take a look and clarify this?

Thank you!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions