Discrepancy in Evaluation Results from top_scores.py

Hello, I have a question regarding the evaluation procedure described in **III. Obtain some top scores**.

I tried using **top_scores.py** to compute the localization results, but I noticed some issues in the script. For example, the repository does not contain a **logs_path** folder. After replacing **logs_path** with **model_logs** and adjusting the script accordingly, I was able to run the evaluation. However, the results I obtained do not match those reported in the paper.

Specifically, for the 16B model, the paper reports:

LLMAO with CodeGen-16B
top-1: 88 (22.3%)  
top-3: 149 (37.7%)  
top-5: 183 (46.3%)

**But my output was:**

Top 1,3,5 of 395 total bugs for defects4j-16B:
[57 (14.4%) & 76 (19.2%) & 90 (22.8%)]

I’m wondering if I might be using the wrong folder, or if the contents of **model_logs** need to be updated. Could you please take a look and clarify this?

Thank you!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Discrepancy in Evaluation Results from top_scores.py #2

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Discrepancy in Evaluation Results from top_scores.py #2

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions