Open
Conversation
* fix HumanEval_32, fix typo in prompt * fix HumanEval_38, fix no example in prompt * fix HumanEval_41, fix no example in prompt * fix HumanEval_47, fix wrong example in prompt * fix HumanEval_50, fix no example & ambiguous prompt * fix HumanEval_57, fix ambiguous prompt & typo * fix HumanEval_67, fix typo in prompt * fix HumanEval_83, fix no example in prompt * fix HumanEval_95, fix wrong canonical solution & incomplete test cases * fix HumanEval_163, fix wrong canonical solution & wrong test cases
* fix HumanEval_75, fix wrong prompt * fix HumanEval_116, fix wrong prompt and wrong examples in prompt
* fix HumanEval_64, fix unnecessary statement in prompt * remove unnecessary leading spaces in prompt
|
Those fixes makes a lot of sense, I'm surprised it was not merged |
|
Since this seems to be a good place to collect issues with the dataset, I'd like to add one more: In HumanEval/33, the assert tuple(candidate([1, 2, 3])) == tuple(sort_third([1, 2, 3]))I guess that the result of that function call should be inlined. Otherwise, the identity function could pass the first three assertions and they add nothing of value. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Dear HumanEval Maintainers,
Thank you so much for sharing this awesome Test Set!
I fully understand that due to the nature of a Test Set, we want to keep it unchanged as much as possible. However, during our usage, a few mistakes were found in some prompts, canonical solutions, and test cases. (some were also raised in previous issues https://github.com/openai/human-eval/issues).
These mistakes indeed affect the ability of HumanEval to accurately reflect the performance of a Code Generation Model. Therefore, here I'd love to propose an enhanced version of HumanEval, which fixes these known mistakes.
The changes made to the original repo:
human-eval-enhanced-202307.jsonl.gzto folder\data. This file is the compressed fixed dataset including the following 14 changes. Details about the mistakes and changes are documented in another filetests.pyin the folder\data.Add file
tests.pyin the folder\data. This file includes tests for the changes inhuman-eval-enhanced-202307.jsonl, and also details about the mistakes in the original data sethuman-eval-v2-20210705.jsonl. The tests can be run as a Script, using the Commandpython tests.py, or they can be run by pytest, following the detailed instructions at the top oftests.py.Add file
.gitignoreto the root directory. This file includes common files to ignore when building a Python project, especially.pytest_cacheand__pycache__sincetests.pycan be run by pytest. This ".gitignore" file is not really important and can be optionally removed from this PR.Thanks for your time reviewing this PR. Any feedback would be much appreciated : )
[UPDATE] So sorry for not using compressed files to avoid data leakage in the first place, it's an honest mistake. It's fixed now in this PR and there'll be no leakage after it's Squash-and-Merged. However, uncompressed files are still in some other closed accidental PR history. I can reach out to GitHub support to delete them if necessary.
Sincerely,
marcusm117