I've spent tons of time evaluating o1-preview on SWEBench-Verified.
For one, I speculate OpenAI is using a very basic agent harness to get the results they've published on SWEBench. I believe there is a fair amount of headroom to improve results above what they published, using the same models.
For two, some of the instances, even in SWEBench-Verified, require a bit of "going above and beyond" to get right. One example is an instance where the user states that a TypeError isn't properly handled. The developer who fixed it handled the TypeError but also handled a ValueError, and the golden test checks for both. I don't know how many instances fall in this category, but I suspect its more than on a simpler benchmark like MATH.
We currently don't save the browser state after the run has completed but that's something we can definitely add as a feature. Could you elaborate on your use case? In which scenarios would it be better to split a run into multiple steps?
Almost any process that involves the word "workflow" (my mental model is one where the user would press alt-tab to look up something else in another window). The very, very common case would be one where they have a stupid SMS-based or "click email link" login flow: one would not wish to do that a ton, versus just leaving the session authenticated for reuse later in the day
Also, if my mental model is correct, the more browsing and mouse-movement telemetry those cloudflare/akamai/etc gizmos encounter, the more likely they are to think the browser is for real, versus encountering a "fresh" one is almost certainly red-alert. Not a panacea, for sure, but I'd guess every little bit helps
The way we plan to handle authenticated sessions is through a secret management service with the ability to ping an endpoint to check if the session is still valid, and if not, run a separate automation that re-authenticates and updates the secret manager with the new token. In that case, it wouldn't need to be stateful, but I can certainly see a case for statefulness being useful as workflows get even more complex.
As for device telemetry, my experience has been that most companies don't rely too much on it. Any heuristic used to identify bots is likely to have a high false positive rate and include many legitimate users, who then complain about it. Captchas are much more common and effective, though if you've seen some of the newer puzzles that vendors like Arkose Labs offers, it's a tossup whether the median human intelligence can even solve it.
I like to jokingly call founder mode: "fine-grained multi-level oversight". Others might call it the derogatory "micromanagement".
That doesn't mean I control every decision, or that I don't give people space to be creative. What it means is: for whatever is most important for the business, I get involved with the details. The goal is that when I move out of that area, the team I worked with is able to operate closer to founder mode than when I started.
The issue is that vision fundamentally can't be communicated by telephone, or all at once. You're trying to get to a point on the map that most people can't see. The path to it is the integration of all of the tiny decisions everyone makes along the way.
If you only course correct from the highest level you'll never get there.