Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Show HN: New eval from SWE-bench team evalutes LMs based on goals not tickets (codeclash.ai)
4 points by lieret 1 day ago | hide | past | favorite | 1 comment
Current evals test LMs on tasks: "fix this bug," "write a test"

But we code to achieve goals: maximize revenue, cut costs, win users

Meet CodeClash: LMs compete via their codebases across multi-round tournaments to achieve high-level goals.

Because real software dev isn’t about following instructions. It’s about achieving outcomes.

Here's how it works:

Two LMs enter a tournament. Each maintains its own codebase.

Every round:

1. Edit Phase: LMs modify their codebases however they like 2. Competition phase: Codebases battle in an arena. 3. Repeat

The LM that wins the majority of rounds is declared winner.

Arenas can be anything like games, trading sims, cybersec envs. We currently have 6 arenas implemented and support for 8 different programming languages.

This has been one of our biggest projects in terms of scale to date. Over the past few months, we've completed 1.5k tournaments, totalling more than 50,400 agent runs. And you can look at all of these runs right now from your browser (links below!)

You can find the rankings on our website (spoiler: Sonnet 4.5 tops the list), but almost more interesting: Humans are still way ahead! In one of our arena, even the worst solution from the human leaderboard is miles ahead of the best LM!

And we're not surprised: LMs consistently fail to properly adapt to outcomes, hallucinate about reasons for failure, and produce ever messier codebases with every round.

More information:

https://codeclash.ai/ https://arxiv.org/pdf/2511.00839 https://github.com/codeclash-ai/codeclash





Is competition + limited resources (e.g. Core War) = selection pressures (natural or otherwise).

Can we integrate and bring back reinforcement learning in a framework like this?




Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: