Please do not pollute the browsecomp eval

Hi,

I think this is a really cool and useful project. 
But I noticed that this test file leaked the golden answer: https://github.com/SalesforceAIResearch/enterprise-deep-research/blob/ad0d535a2569759e87b523cd01ee5ac091dcb3df/test_benchmark.py#L17

According to browsecomp's release note, all questions/answers should be encrypted.

Currently this pollution is impacting my model's training/eval, and perhaps yours, as well.

Can you use a dummy QA for test rather than using the real data from Browsecomp?

This will benefit both the community and yourself.

Thanks!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Please do not pollute the browsecomp eval #4

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Please do not pollute the browsecomp eval #4

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions