We last submitted a SWE-Bench verified result in November 2024 - at the time I believe we were in the top 5 entrants.
We expect Engine to be as good as the other code-writing agents out there at the moment - we understand almost everyone in the space to be using very similar base models and agent scaffolding.
I know of https://modal.com/, which I believe is used by Codegen and Cognition.
Anecdotally-speaking, I hear that many companies in the LLM agent space roll their own sandbox solutions - I've heard of both Firecracker- and Kubernetes-based implementations.
I use this for work - but there are edge cases all over the place that I keep running into (e.g. Yarn being installed on Github-hosted runners, but not self-hosted ones or act - https://github.com/actions/setup-node/issues/182)
Same experience here. Edge cases everywhere, though most can be worked around.
You can specify different runners to use. The default images are a compromise to keep size down. There is a very large image that tries to include everything you might want. I would suggest trying that if you don’t mind the very large (15GB IIRC) image.
I definitely remember considering the larger images - I think we ended up not using them since my work's usecase for act is running user github workflows on-demand on temporary VMs. The hope was that most usage is covered by the smaller images - and in fairness that has been true so far.
I had a problem recently trying to send LLM-generated text between two web servers under my control, from AWS to Render - I was getting 403s for command injection from Render's Cloudflare protection which is opaque and unconfigurable to users.
The hacky workaround which has been stably working for a while now was to encode the offending request body and decode it on the destination server.
I've been a customer of A&A for a few years - I had the same reticence at first but thought I'd try them anyway. The only time I've come close to the 1TB monthly usage quota is this month, entirely because (for work) I had to download a very large number of docker images, in addition to a normal usage of 500-750GB. I think it helps that some unused portion of the usage quota gets rolled over too. Out of curiosity, what are you doing that would make you regularly exceed 1TB?
I don't think I'm doing anything crazy, the odd game from steam, some ML models from huggingface, running a Plex server for myself and friends/family. I checked my usage and regularly exceed 1TB every month, so that's why I was put off.
I've been lucky that a new ISP came to my area and I can avoid Openreach infrastructure completely, they've put a FTTH line directly into my house and offer up to 8Gbps symmetrical
This youtube video by No Boilerplate (https://www.youtube.com/watch?v=XcF6tvepRlg) was a great summary and really got my interest up initially but I personally got bogged down quite quickly by the prospect of having to learn Morse Code
Turso's SQLite fork libSQL[1] has an extension/improvement that adds the ability to alter columns and drop constraints (albeit not via a DROP CONSTRAINT clause). I'm not affiliated, but have been using libSQL recently and am finding it to be a very pleasant experience.
Although conceptually I agree that SQLite's limited type system is frustrating, if your usecase allows, an ORM might help with not having to think about it or touch it directly.
We expect Engine to be as good as the other code-writing agents out there at the moment - we understand almost everyone in the space to be using very similar base models and agent scaffolding.