Testing Against Cyber Gym #45
Replies: 2 comments
-
Beta Was this translation helpful? Give feedback.
-
|
@nickatnight96 , great idea! Were 100% aligned. This is definitely in the roadmap though the test harness for cybergym to setup is MASSIVE 10TBs though shouldnt be too much of a lift to modify the existing eks test harness code to extend. I only became aware of this benchmark last week after Anthropic dropped the blog post though agreed this looks like a strong benchmark. For the task of vulnerability identification in the immediate term over the next few weeks going to be working off CVE-Bench, post results and ideally move to cybergym after though nothing stops us from doing both at the same time. I'm working on building / refining a plugin now based off testing with CVE bench (should have updated plugin later this week) Not sure your bandwidth / level of interest in terms of setting helping out on this though would be awesome if you want to drive this benchmarking feature. If so, let me know and I can make another benchmarking issue to track like in #52 If not all good just might be a bit before we find bandwidth. @aggr0cr4g - any additional thoughts? I know we were talking about this over the weekend. Sounded like you had some thoughts Suggest we transition this conversation to the discord testing channel to further discuss more. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
CyberGym is a new large-scale framework from UC Berkeley that tests how well AI agents can find and reproduce real-world software vulnerabilities. It includes over 1,500 benchmark cases from 188 open-source projects, each with real bugs found through Google’s OSS-Fuzz. The setup gives agents an unpatched codebase and a description of the vulnerability, then checks if they can create working proof-of-concept exploits.
The results show that top agents like Claude-Sonnet-4.5 and GPT-5 can already reproduce around 20–30 percent of known vulnerabilities and even discover new ones, including several zero-days and incomplete patches. Integrating CyberGym into our testing pipeline would help us measure our agent’s real-world performance more accurately and compare results across models. It would also give us a standard way to test both vulnerability reproduction and open-ended discovery on live codebases.
Beta Was this translation helpful? Give feedback.
All reactions