Fix Mistakes in the Dataset by marcusm117 · Pull Request #23 · openai/human-eval

marcusm117 · 2023-05-11T17:24:50Z

Dear HumanEval Maintainers,

Thank you so much for sharing this awesome Test Set!

I fully understand that due to the nature of a Test Set, we want to keep it unchanged as much as possible. However, during our usage, a few mistakes were found in some prompts, canonical solutions, and test cases. (some were also raised in previous issues https://github.com/openai/human-eval/issues).

These mistakes indeed affect the ability of HumanEval to accurately reflect the performance of a Code Generation Model. Therefore, here I'd love to propose an enhanced version of HumanEval, which fixes these known mistakes.

The changes made to the original repo:

Add file human-eval-enhanced-202307.jsonl.gz to folder \data. This file is the compressed fixed dataset including the following 14 changes. Details about the mistakes and changes are documented in another file tests.py in the folder \data.

fix HumanEval_32, fix typo in prompt
fix HumanEval_38, fix no example in prompt
fix HumanEval_41, fix no example in prompt
fix HumanEval_47, fix wrong example in prompt
fix HumanEval_50, fix no example & ambiguous prompt
fix HumanEval_57, fix ambiguous prompt & typo
fix HumanEval_64, fix unnecessary statement in prompt
fix HumanEval_67, fix typo in prompt
fix HumanEval_75, fix wrong prompt
fix HumanEval_83, fix no example in prompt
fix HumanEval_95, fix wrong canonical solution & incomplete test cases
fix HumanEval_116, fix wrong prompt and wrong examples in prompt
fix HumanEval_163, fix wrong canonical solution & wrong test cases
remove unnecessary leading spaces in prompts

Add file tests.py in the folder \data. This file includes tests for the changes in human-eval-enhanced-202307.jsonl, and also details about the mistakes in the original data set human-eval-v2-20210705.jsonl. The tests can be run as a Script, using the Command python tests.py, or they can be run by pytest, following the detailed instructions at the top of tests.py.
Add file .gitignore to the root directory. This file includes common files to ignore when building a Python project, especially .pytest_cache and __pycache__ since tests.py can be run by pytest. This ".gitignore" file is not really important and can be optionally removed from this PR.

Thanks for your time reviewing this PR. Any feedback would be much appreciated : )

[UPDATE] So sorry for not using compressed files to avoid data leakage in the first place, it's an honest mistake. It's fixed now in this PR and there'll be no leakage after it's Squash-and-Merged. However, uncompressed files are still in some other closed accidental PR history. I can reach out to GitHub support to delete them if necessary.

Sincerely,

marcusm117

* fix HumanEval_32, fix typo in prompt * fix HumanEval_38, fix no example in prompt * fix HumanEval_41, fix no example in prompt * fix HumanEval_47, fix wrong example in prompt * fix HumanEval_50, fix no example & ambiguous prompt * fix HumanEval_57, fix ambiguous prompt & typo * fix HumanEval_67, fix typo in prompt * fix HumanEval_83, fix no example in prompt * fix HumanEval_95, fix wrong canonical solution & incomplete test cases * fix HumanEval_163, fix wrong canonical solution & wrong test cases

* fix HumanEval_75, fix wrong prompt * fix HumanEval_116, fix wrong prompt and wrong examples in prompt

* fix HumanEval_64, fix unnecessary statement in prompt * remove unnecessary leading spaces in prompt

* use compressed files * update tests.py

kolergy · 2024-01-24T20:29:03Z

Those fixes makes a lot of sense, I'm surprised it was not merged

99991 · 2025-08-25T08:54:30Z

Since this seems to be a good place to collect issues with the dataset, I'd like to add one more:

In HumanEval/33, the check function does not always call the candidate function. Instead, it sometimes calls sort_third directly:

assert tuple(candidate([1, 2, 3])) == tuple(sort_third([1, 2, 3]))

I guess that the result of that function call should be inlined. Otherwise, the identity function could pass the first three assertions and they add nothing of value.

marcusm117 changed the title ~~Fix Mistakes in the Data Set (#4)~~ Fix Mistakes in the Data Set May 11, 2023

marcusm117 added 2 commits June 29, 2023 14:17

Fix More Mistakes in the Data Set (#5)

a86e5ce

* fix HumanEval_75, fix wrong prompt * fix HumanEval_116, fix wrong prompt and wrong examples in prompt

Fix Mistake in HumanEval_64, Remove Unnecessary Leading Spaces (#6)

e3406aa

* fix HumanEval_64, fix unnecessary statement in prompt * remove unnecessary leading spaces in prompt

marcusm117 mentioned this pull request Jul 12, 2023

HumanEval/75 & HumanEval/116 Prompt-Solution-Test Alignment evalplus/evalplus#12

Open

marcusm117 added 4 commits July 12, 2023 19:59

Fix Typoes in test.py, Rename human-eval-enhance to correct date (#7)

3cdff9a

Fix accidental deletion in HumanEval/35 prompt (#8)

e8216ac

Fix duplicate def in HumanEval/116-119 (#9)

67e1617

Use Compressed Files (#10)

64e6808

* use compressed files * update tests.py

marcusm117 changed the title ~~Fix Mistakes in the Data Set~~ Fix Mistakes in the Dataset Jun 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

Fix Mistakes in the Dataset#23

Fix Mistakes in the Dataset#23
marcusm117 wants to merge 7 commits intoopenai:masterfrom
marcusm117:main

marcusm117 commented May 11, 2023 •

edited

Loading

Uh oh!

kolergy commented Jan 24, 2024

Uh oh!

99991 commented Aug 25, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Comments

Conversation

marcusm117 commented May 11, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kolergy commented Jan 24, 2024

Uh oh!

99991 commented Aug 25, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

marcusm117 commented May 11, 2023 •

edited

Loading