Yeah, I also stopped reading at that point. If I want a bunch of random, made up facts to sell lukewarm opinions or steer the uneducated masses, I'll tune in on a Trump press conference. Why does this feel like someone is desperately trying to make reality mirror his flailing market bets?
Probably want to look at SWE bench pro or terminal bench 2. They cover these longer horizon tasks that need more than just writing a bit of code in one file. And SWE bench pro in particular it is not yet saturated like many other common benchmarks. Normal SWE and LCB are not really useful anymore because they are already being gamed hard so the developers can quote high numbers in a repo readme or press release.
Eh. If you get into enterprise business, this is the accepted management style. AI will now mix this up a little, but before you basically needed to ask if you want to blow 300k on developer salaries to maybe fix something that is already working and generating money, or add more features to the roadmap you can pin on your chest. Scaling infrastructure is the best choice for 90% of managers, especially since they are not the ones paying for it and this kind of technical debt doesn't matter on typical bonus check timeframes.
I used to work for AWS on a service team. I noticed we were spending way too much on provisioned concurrency for dynamo and would benefit from on-demand provisioning. After proving it worked, making the change, deploying, was rather pleased with myself. "Saved $2M in costs by switching to on-demand provisioning" barely made it onto my performance review lol.
They might just not have believed it. At the management level everyone is busy claiming to be delivering huge numbers all the time, and people stop trusting that sort of claim.
- the possibility of only partial success creating an even messier situation than the existing one
Having a way to do the whole thing on a much smaller timescale and budget lets decision makers focus more on those externalities, and also can simplify them. This kind of bit rot is somewhere (often everywhere) in many fast-moving businesses, as a natural consequence of the value tradeoffs we have had up to now. Now there are machines that can speedrun the grunt work of clearing them.
Wait a second. They predicted (before it even entered atmosphere) where it was coming down with such a precision that you could not just go out and photograph it, but even go and collect remains? I thought this was barely possible if you have a radar that is actively tracking it through the last stages of the atmosphere, while for anything still in orbit you'd be lucky to guess the correct country.
The things in earth orbit have a very small angle of entry. I'd expect that if the angle is bigger, the deflection from the atmosphere would be smaller (if it survives the hit).
I still have never seen any prediction like that which was made before the thing actually entered the atmosphere. You can see how some known remains sites were determined by clicking on them in this map: https://www.strewnify.com/map/
They predicted the angle and where you could view it from. Finding the fragments much more difficult because it scattered into tiny pieces all over. The students searched for like 2 weeks to find a some pieces the size of a thumb.
Average human threat perceptions simply aren't useful here. People will also make wild assumptions about what kind of catastrophic thing could happen in aviation and then happily enter their car to drive somewhere without a thought in the world. In fact noone thought about designing gasoline fuel tanks in a safe way before we had cars. Not even really until people started burning. If we're already thinking about transporting antimatter safely today, this kind of technology will probably have an even better track record than planes.
Most software is already available on Linux. I've successfully run Linux in corporate jobs where everything runs on the MS/AD/Azure stack. The issue is not that you can't do it, the issue is that you have to spend extra work at every corner to get things running, because unlike Windows Linux doesn't take your hand and hide all the nasty bits from you, while it tries to juggle a million cases in the background. Windows is really great at that - until it breaks. Then you're usually screwed. Like, if the problem is close to the kernel, you can't even fix it theoretically. Best you can do is wait for an official MS patch. On Linux things break more often, but you can usually fix them without having to resort to extreme measures. It's a fundamentally different usage philosophy that plays very hard into the strengths of techies. So non-technical users will always shy away from Linux.
> the issue is not that you can't do it, the issue is that you have to spend extra work at every corner to get things running, because unlike Windows Linux doesn't take your hand and hide all the nasty bits from you, while it tries to juggle a million cases in the background.
You may have to spend extra work to get things running; but once it's done, it runs forever without a hitch.
I know, I use Slackware. It's regarded as a very technical distribution and some manual configuration is expected but once it's done, it's done. I have configs from > 20 years ago that I still use without a hiccup.
>but once it's done, it runs forever without a hitch.
Yeah... no. If you're dealing with changing systems, you'll need continued support from maintainers. And there's a lot of stuff out there in the business world that is commonly used and breaks all the time. Stuff will break. If not, it is not getting updated. In that case I'd be more worried about security than compatibility.
Yeah... yes. There are systems which are continuously maintained but don't break all the time. Yes, stuff will break but this is way less common in Linux.
Claws-mail has all my email for over 15 years. My inbox is several gigabytes in size, which claws handles flawlessly. And the software is continuously maintained. I'm using version 4.4.0 now, which was released 16 days ago on March 9.
Turns out email clients are quite simple (mostly because the protocol is ancient) and also something everyone in every company uses. But many OSS clients still die eventually. And once you get into the actual business application world, you're in for a world of pain on Linux. Especially if you go near AD/Azure/Entra. Heck, the fact there is not even a stable name for this mess of a software suite tells you enough. And yet every big company relies on it.
I don't know what are these nasty bits windows is supposedly hiding, or what exactly breaks more often on Linux. For me it's exact opposite: my linux just never breaks. I don't do anything special, just plug in the hdd into new box bought when old gets too slow for new tasks, continue as nothing happened.
Uptimes of half a year are not uncommon, the record so far is 400+ days. I just don't shut it down unless there's a serious kernel or hardware upgrade.
It just works, non-kernel updates, stuff being plugged/unplugged, couple times I swapped sata hdds without turning off power (which is simple, they are hotplug by design, just don't drop the screws onto motherboard and don't forget to unmount+detach first).
Now, when I used to and test some cross-builds for windows (win7-win10 era), I had another dedicated windows machine for that. And even though I tried to make it as stable as possible, it was a brittle piece of junk, in comparison.
So in my experience, yes, linux is fundamentally different usage philosophy: you don't need to think about what crap Microsoft will break your workflow with next Tuesday.
We already know exactly what causes these bugs. They are not a fundamental problem of LLMs, they are a problem of tokenizers. The actual model simply doesn't get to see the same text that you see. It can only infer this stuff from related info it was trained on. It's as if someone asked you how many 1s there are in the binary representation of this text. You'd also need to convert it first to think it through, or use some external tool, even though your computer never saw anything else.
> It's as if someone asked you how many 1s there are in the binary representation of this text.
I'm actually kinda pleased with how close I guessed! I estimated 4 set bits per character, which with 491 characters in your post (including spaces) comes to 1964.
Then I ran your message through a program to get the actual number, and turns out it has 1800 exactly.
>I estimated 4 set bits per character, which with 491 characters in your post (including spaces) comes to 1964
And that's exactly the kind of reasoning an LLM does when you ask it about characters in a word. It doesn't come from the word, it comes from other heuristics it picked up during training.
Okay but, genuinely not an expert on the latest with LLMs, but isn’t tokenization an inherent part of LLM construction? Kind of like support vectors in SVMs, or nodes in neural networks? Once we remove tokenization from the equation, aren’t we no longer talking about LLMs?
It's not a side effect of tokenization per se, but of the tokenizers people use in actual practice. If somebody really wanted an LLM that can flawlessly count letters in words, they could train one with a naive tokenizer (like just ascii characters). But the resulting model would be very bad (for its size) at language or reasoning tasks.
Basically it's an engineering tradeoff. There is more demand for LLMs that can solve open math problems, but can't count the Rs in strawberry, than there is for models that can count letters but are bad at everything else.
Nothing specific yet, but the legal groundwork has been laid both in the US and in the EU. Starting in July, all new cars sold in the EU will need to be able to fit after-market alcohol interlocks. In the US, interlocks are already mandatory for convicted DUIers in most states, but new cars will also have to come with factory installed drunk driving prevention technology in the coming years. We just don't know how far that mandate will go eventually.
obviously it will require an age verification, also you need to tell Google that you want to go somewhere 24 hours in advance, and Apple gets 30% of the revenue that gas stations make.
Manufacturers are now encrypting Canbus traffic, voluntarily on current and future models.
Buying or selling tools designed to break the law is already illegal - trivial or not. If a driver gets a DUI and possess a NOOP interlock, they are getting an additional charge, and get to help am investigation into the illicit device supply chain.
> Buying or selling tools designed to break the law is already illegal - trivial or not.
I'm curious how this will play out. The "John Deer" exemption from the DMCA comes to mind, not sure if it's strictly for farm equipment or still in effect.
Sleeping moves your memories from your working memory in your neocortex to your long term memory in your hippocampus. If you were an LLM, sleeping would basically move the contents from your adaptive system/memory prompt to the underlying model weights. It's weird that noone has really done that yet, but I can understand why the big AI chat corpos don't do it: You'd have to store a new model with new weights for each user if you don't want to risk private info spilling to others. If you have a billion users, you simply cant do that (at least not without charging obscene amounts of money that would prevent you from having a billion users in the first place). Current LLM architectures that start with a clean slate for every conversation are really good for serving to billions of people via cloud GPUs, because they can all run the exact same model and get all their customization purely from the input. So if we ever get this, it'll probably be for smaller, local, open models.
On a much simpler level, llm frameworks could re-summarize their context to keep relevant, use-case-specific facts, cleanup and also organize long and short term memory on some local storage, etc. So kind of like sleep. I think these examples are low hanging fruit to improve the perceived intelligence of LLM systems (so probably they're already used somewhere).
We already have that for a while. It works to some degree, but context tokens simply don't offer the level of compression that model weights do. At least with current approaches that keep the context human-readable.
Same way you distill any model. Training data efficiency matters only while you train the source model/ensemble. Once you have that you are purely compute bound during distillation.
reply