AI Benchmarks...
We built an agent that helped us hack eight benchmarks. We achieved near-perfect scores on all of them without solving a single task.
AI Benchmarks...
We built an agent that helped us hack eight benchmarks. We achieved near-perfect scores on all of them without solving a single task.