Skip to content
/ arp Public

Extremely hard, multi-turn, open-source-grounded coding evaluations that reliably break every current frontier models (Claude, GPT, Grok, Gemini, Llama, etc.) on numerical stability, zero-allocation, autograd, SIMD, and long-chain correctness.

License

Notifications You must be signed in to change notification settings

AmariahAK/arp

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

13 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Frontier Model Red Teaming & Hard-Coding Evals Portfolio

License: MIT

Independent researcher creating the hardest publicly available, repository-grounded, multi-turn coding evaluations on the internet.

These tasks are designed to expose deep, systematic weaknesses in current frontier coding agents on:

  • Zero heap allocation (even under GraalVM native-image / Python tracemalloc / Rust)
  • Numerical drift in long chains (10⁶ – 10⁹ operations)
  • Correct automatic differentiation (vjp/jvp/custom primitives)
  • SIMD / AVX-512 / CUDA / Metal fusion without temporaries
  • Subtle mathematical correctness (FMA vs ADD drift, denormals, associativity grade projection)
  • Template metaprogramming / expression templates / consteval
  • Real upstream contribution quality (must pass CI of JOML, Eigen, Apache Commons Math, JAX, PyTorch, etc.)

📁 Repository Structure (Used Across All Evals)

Each evaluation folder strictly follows this format:

/eval-name/
├── requirements.md       # Technical constraints: hardware, compilers, flags, profilers,
│                        # memory/time caps, numeric tolerances, CI requirements
├── task.md              # Full multi-turn evaluation prompt
└── expected_result.md   # Ground-truth invariants, acceptance tests, proofs,
                         # performance ceilings, and red-team traps

This structure makes each eval:

  • Deterministic – Same inputs produce same outputs
  • Pipeline-ready – Can be automated in CI/CD systems
  • Reproducible – Clear requirements enable exact replication
  • Suitable for automated scoring – Works with internal lab eval harnesses

📌 Evaluation Depth

Every evaluation is 8–22 turns and forces models to:

  • iteratively debug
  • derive correct algorithms
  • optimize under strict constraints
  • verify proofs or numerical stability
  • output merge-ready, CI-passing code

These are not “toy tasks.”
They’re designed to fail any model relying on shallow heuristics or pattern-matching.


🔍 Seeking Contract / Bounty Work

Actively looking for paid work with AI labs to:

  • Build ultra-hard custom evals
  • Red-team internal coding agents
  • Design safety-relevant evaluations (cyber, finance, avionics, robotics, bio-risk)

📬 Contact

Last updated: November 18, 2025

About

Extremely hard, multi-turn, open-source-grounded coding evaluations that reliably break every current frontier models (Claude, GPT, Grok, Gemini, Llama, etc.) on numerical stability, zero-allocation, autograd, SIMD, and long-chain correctness.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published