Hacker News new | past | comments | ask | show | jobs | submit | slewis's comments login

It would be really useful to see these evaluated across some of the same evals that the original R1 and deepseek's distills were evaluated on.


Hey, that's me!

Happy to answer any questions about how this works if folks are interested.


I've spent tons of time evaluating o1-preview on SWEBench-Verified.

For one, I speculate OpenAI is using a very basic agent harness to get the results they've published on SWEBench. I believe there is a fair amount of headroom to improve results above what they published, using the same models.

For two, some of the instances, even in SWEBench-Verified, require a bit of "going above and beyond" to get right. One example is an instance where the user states that a TypeError isn't properly handled. The developer who fixed it handled the TypeError but also handled a ValueError, and the golden test checks for both. I don't know how many instances fall in this category, but I suspect its more than on a simpler benchmark like MATH.


So what percentage would you say falls to simple inability versus the other two factors you've mentioned?


Is it stateful? Like can I do a run, read the results, and then do another run from that point?


We currently don't save the browser state after the run has completed but that's something we can definitely add as a feature. Could you elaborate on your use case? In which scenarios would it be better to split a run into multiple steps?


Almost any process that involves the word "workflow" (my mental model is one where the user would press alt-tab to look up something else in another window). The very, very common case would be one where they have a stupid SMS-based or "click email link" login flow: one would not wish to do that a ton, versus just leaving the session authenticated for reuse later in the day

Also, if my mental model is correct, the more browsing and mouse-movement telemetry those cloudflare/akamai/etc gizmos encounter, the more likely they are to think the browser is for real, versus encountering a "fresh" one is almost certainly red-alert. Not a panacea, for sure, but I'd guess every little bit helps


The way we plan to handle authenticated sessions is through a secret management service with the ability to ping an endpoint to check if the session is still valid, and if not, run a separate automation that re-authenticates and updates the secret manager with the new token. In that case, it wouldn't need to be stateful, but I can certainly see a case for statefulness being useful as workflows get even more complex.

As for device telemetry, my experience has been that most companies don't rely too much on it. Any heuristic used to identify bots is likely to have a high false positive rate and include many legitimate users, who then complain about it. Captchas are much more common and effective, though if you've seen some of the newer puzzles that vendors like Arkose Labs offers, it's a tossup whether the median human intelligence can even solve it.


This 100% matches my experience.

I like to jokingly call founder mode: "fine-grained multi-level oversight". Others might call it the derogatory "micromanagement".

That doesn't mean I control every decision, or that I don't give people space to be creative. What it means is: for whatever is most important for the business, I get involved with the details. The goal is that when I move out of that area, the team I worked with is able to operate closer to founder mode than when I started.

The issue is that vision fundamentally can't be communicated by telephone, or all at once. You're trying to get to a point on the map that most people can't see. The path to it is the integration of all of the tiny decisions everyone makes along the way.

If you only course correct from the highest level you'll never get there.


Creator describing how this works: https://youtu.be/PHQweR1z7pI?si=BpRlWxtxnRmYaeKG


Overplay one’s hand: spoil one's chance of success through excessive confidence in one's position


The memorization use case is brilliant. Put your talk track for a presentation in and say “help me memorize this by quizzing me”. Thanks!


I call this "keep the fingers moving".


Clicked the youtube link. An hour and 20 minutes later here I am. Thanks for sharing it that was awesome.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: