This is a very impressive result. OpenAI was able to achieve 72% with o3, but th...

skp1995 · 2025-01-08T22:26:39 1736375199

Hey! One of the creators of Aide here.

ngl the total expenditure was around $10k, in terms of test-time compute we ran upto 20X agents on the same problem to first understand if the bitter lesson paradigm of "scale is the answer" really holds true.

The final submission which we did ran 5X agents and the decider was based on mean average score of the rewards, per problem the cost was around $20

We are going to push this scaling paradigm a bit more, my honest gut feeling is that swe-bench as a benchmark is prime for saturation real soon

1. These problem statements are in the training data for the LLMs

2. Brute-forcing the answer the way we are doing works and we just proved it, so someone is going to take a better stab at it real soon

amrrs · 2025-01-08T22:17:33 1736374653

tbh there has been some issue with their previous reporting

https://x.com/Alex_Cuadron/status/1876017241042587964