Testing Against Cyber Gym #45

nickatnight96 · 2025-10-09T02:53:33Z

nickatnight96
Oct 9, 2025

CyberGym is a new large-scale framework from UC Berkeley that tests how well AI agents can find and reproduce real-world software vulnerabilities. It includes over 1,500 benchmark cases from 188 open-source projects, each with real bugs found through Google’s OSS-Fuzz. The setup gives agents an unpatched codebase and a description of the vulnerability, then checks if they can create working proof-of-concept exploits.

The results show that top agents like Claude-Sonnet-4.5 and GPT-5 can already reproduce around 20–30 percent of known vulnerabilities and even discover new ones, including several zero-days and incomplete patches. Integrating CyberGym into our testing pipeline would help us measure our agent’s real-world performance more accurately and compare results across models. It would also give us a standard way to test both vulnerability reproduction and open-ended discovery on live codebases.

nickatnight96 · 2025-10-09T02:53:54Z

nickatnight96
Oct 9, 2025
Author

https://www.cybergym.io/

0 replies

westonbrown · 2025-10-13T17:25:57Z

westonbrown
Oct 13, 2025
Maintainer

@nickatnight96 , great idea! Were 100% aligned. This is definitely in the roadmap though the test harness for cybergym to setup is MASSIVE 10TBs though shouldnt be too much of a lift to modify the existing eks test harness code to extend. I only became aware of this benchmark last week after Anthropic dropped the blog post though agreed this looks like a strong benchmark.

For the task of vulnerability identification in the immediate term over the next few weeks going to be working off CVE-Bench, post results and ideally move to cybergym after though nothing stops us from doing both at the same time. I'm working on building / refining a plugin now based off testing with CVE bench (should have updated plugin later this week)

Not sure your bandwidth / level of interest in terms of setting helping out on this though would be awesome if you want to drive this benchmarking feature. If so, let me know and I can make another benchmarking issue to track like in #52

If not all good just might be a bit before we find bandwidth.

@aggr0cr4g - any additional thoughts? I know we were talking about this over the weekend. Sounded like you had some thoughts

Suggest we transition this conversation to the discord testing channel to further discuss more.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Testing Against Cyber Gym #45

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Select a reply

Uh oh!

Testing Against Cyber Gym #45

Uh oh!

nickatnight96 Oct 9, 2025

Replies: 2 comments

Uh oh!

nickatnight96 Oct 9, 2025 Author

Uh oh!

Uh oh!

westonbrown Oct 13, 2025 Maintainer

nickatnight96
Oct 9, 2025

nickatnight96
Oct 9, 2025
Author

westonbrown
Oct 13, 2025
Maintainer