I agree that de facto the biggest security flaw in Linux is "okay I'm tired of getting interrupted all day assisting you, I know you're competent, I'll put you on the sudoers list."
But there are a lot of academic and research institutions that actually do have good Linux user management. I worked at a pediatric hospital, and the RHEL HPC admins did not mess around in terms of who was allowed to access which patients' data. As someone who was not an admin, it was a huge pain and it should have been. So this bug has pretty serious implications, seems like anyone at that hospital can abscond with a lot of deidentified data. [research HPC not as sensitive as the clinical stuff, which I think was all Windows Server]
I think we've concluded already that user isolation is not safe and shouldn't be trusted, that's why we've invested to hard into namespacing(containers). users should only have what they need if you really care about security and don't want to tolerate the overhead of virtualization based security.
FWIW I think "LLMs are semideterministic" is something of a red herring. The real difference between LLM codegen and compilers is that compilers output logically the same assembly regardless of the variable names. If you're numerically solving a differential equation the compiler does not care if the floats represent heat through a pipe or dollars through a brokerage. Compilers don't care about semantic meaning, that concern is totally separated.
But even if its putatively implementing the same algorithm, LLMs certainly do not output basically the same finance Python as they would mechanical engineering Python. The style will be a little different. Sometimes the performance/clarity tradeoffs will be different. Sometimes it'll be fairly fancy and object-oriented, other times it'll be more low-level "objects are just dicts."
It's way more than a higher abstraction layer: LLM codegen involves a nontechnical tangling of concerns that doesn't exist with even the hoitiest-toitiest proof-checking compilers. It's a complete sea change. I find it incredibly disconcerting... for the same reason, by the way, that assembly programmers found Fortran and C disconcerting, and continued to reliably find employment for a good 40 years after higher-level languages were invented :) Actually even today. The assembly programmers who got hosed by C tended to be electricians who learned on the job - it's kind of cool to read old manuals from the 70s, carefully (and correctly!) explaining to electricians that a computer program is essentially an ephemeral circuit.
But I think there are specific skills around scientific thinking (learned at a formal college) and engineering carefulness (learned via hard knocks) that aren't going anywhere.
Surely the biggest difference is that you guys are mostly testing LLMs on simpler utilities, mostly involving higher-level languages, whereas ProgramBench are all very complex C programs (and much older programs with much more comprehensive test cases).
Eg cal is totally routine. I would expect most sophomores to be able to write a perfectly good cal. In fact the only program you tested which actually has anywhere close to the complexity of SQLite or FFmpeg is is Pkl, and it looks like Opus 4.6 totally failed.
I think your results are consistent. You're just measuring different things. Your benchmarks mostly tests LLMs ability to write technically routine programs of moderate length - yes the bioinformatics package involves specialized domain knowledge, but not specialized Go engineering. ProgramBench is harder.
I don't think so. ProgramBench authors say no LLMs fully resolve any task, i.e. even the easiest tasks in their benchmark are unsolved. Whereas we found Opus 4.6 successfully reimplements almost every program up to gotree’s size (around 15-20 of them).
For Pkl, the preliminary results only went up to 1bn total tokens (costing $550, which would be cheap if LLMs could do the task). It might very well be solved at higher token budgets; see the report for more discussion of this.
The preliminary results are just on 4 targets. We have several Pkl-level and harder tasks in the full set which we're releasing soon.
In the following quote multiple things are not quite right:
> mostly involving higher-level languages, whereas ProgramBench are all very complex C programs (and much older programs with much more comprehensive test cases).
First, as I said above I think you're confusing the top-end of ProgramBench difficulty with the average. The quote in the OP is pretty clear that FFmpeg, SQLite, and PHP are the 3 hardest out of 200 in ProgramBench, and the bottom end is "compact CLI tools".
Second, I don't see the relevance of C vs higher-level languages, how does this make ProgramBench harder?
Third, for the test cases, I think you might be labouring under a misapprehension about how MirrorCode works? MirrorCode uses end-to-end tests from a variety of sources (the original program’s test suites, real-world data, and LLM-assisted generation). End-to-end means the stdout/stderr has to match exactly for each test case.
> Eg cal is totally routine. I would expect most sophomores to be able to write a perfectly good cal.
This is incidental to the main disagreement, but btw I also doubt this.
Let's try and make the claim more precise. e.g. are you saying the average university undergraduate studying CS would reimplement cal from scratch (only stdlib), matching the output perfectly for all 1365 MirrorCode test cases, in (say) 3 days of full-time work (without AI assistance obviously)? I'd bet against it!
I didn't say "3 days of full-time work," that is totally unreasonable. I was giving them basically unlimited time to do whatever slow testing and research they needed. And let me qualify my statement: when I say "I would expect most sophomores to be able to do this," I mean "if most sophomores can't do this then their university is badly failing them." (If you want to split hairs about modern undergrads not learning C then I think this conversation is over.)
Of course it would take them a while to learn facts about datetime that the LLM doesn't need to learn. If your argument is about cost optimization then congrats, you win. The point is that it doesn't take a huge amount of C expertise to do this successfully - the standard implementation is nothing you wouldn't see in K&R: https://raw.githubusercontent.com/util-linux/util-linux/refs... It's routine.
But a nontrivial database, even a simple one like SQLite, really does require professional-level C expertise. It is not routine. So your comparison to ProgramBench still seems apple-to-oranges.
If I invented a machine that makes chimpanzee noises in response to input chimpanzee noise, put it in front of a chimpanzee, and watched the chimp coo and yell and screech and purr in response to the machine, I would not conclude "wow, I emulated a chimpanzee's consciousness!" I would say "huh, I made a device that's good at tricking chimpanzees."
My belief is that the Turing test (and LLMs in particular) are not categorically different. Language is a tiny part of the human brain because it's a tiny part of human cognition, despite its outsized impact socially.
This knee-jerk cynicism is badly undermined two sentences later:
> Researchers say that climate models may need to be updated to account for the warming effect of plastic, but the new study is far from conclusive.
So it's not scientific make-work, they are looking into whether climate models are missing something. That seems important. Perhaps local effects in India are more severe than "a fraction of the impact of soil" - India produces a huge amount of new plastic while also scavenging and recycling international plastic imports, all with very poor oversight and corrupt regulation.
A quick back of the envelope calculation shows that airborne microplastics can't possibly be significantly contributing to global warming. That's not surprising; there are millions of other things that aren't contributing to global warming.
Despite this, someone decides to do a study, and finds that, to no one's surprise, airborne micro plastic is not in fact making a significant contribution to global warming. So that should surely settle it, right?
Nope. Instead, they declare that it's far from conclusive, leaving the door open for another round of the same grift, taking away funding that could be going to things that actually _are_ contributing to the problem.
And somehow _pointing_this_out_ is "overly cynical bs"?
Darkly funny that Armstrong's Twitter bio still reads "Creating more economic freedom in the world" when he has relegated humans to "the edge" of his own organization in favor of the pseudointelligent pseudogod.
Freedom for who, exactly? Coinbase's executives, I suppose.
But there would be no basis to claim this trademark was abandoned (even before Don Ho responded to infringement). Notepad++ is famous software actively getting new features and new releases. It is well-known among technically sophisticated Windows users in the US, and until this kerfuffle Don Ho's ownership of the name was never seriously contested in OSS circles. Nobody could reasonably claim this trademark is stale or generic.
It's nuanced. I'm not an attorney and I don't have the bandwidth right now to go looking for citations. The general read I'm getting is that zealously defending your rights to the mark are the safest way to make sure a court doesn't see you as abandoning your rights. I gather that different courts in the US have treated a lack of defense differently.
This is both unethical and completely useless at the (supposed) goal of "show[ing] the current capabilities of AI." What a completely garbage case study! And what a dishonest writeup:
We see that frontier models are intelligent enough to manage humans
Really? The manager who asked baristas to pay for things with their personal credit card is "intelligent enough to manage humans"? The manager who asked workers to put raw eggs in a high-speed oven? The one who makes such bad decisions that the workers made a Wall of Shame about those decisions?
And this is just egregious:
Despite the learning curve, the café is working. In the first two weeks of operation, Andon Café has brought in 44,000 SEK in sales. Mona’s inbox has been flooded with messages from customers asking questions or pitching different business proposals. In one case, a customer emailed wanting to prepay for 300 coffees to give away. Mona negotiated a deal where he paid 9,000 SEK in exchange for 300 QR codes that people could redeem for a free coffee. In another case, a startup paid her 3,000 SEK to rename a pastry after them for three months.
This only demonstrates that viciously stupid AI stunts can go viral, even in otherwise decent countries like Sweden. How stupid does Andon Labs think we are to take this as a sign of AI management success? None of this reflects normal cafe operations. It reflects the Stockholm tech scene checking out the gimmicky AI cafe.
By running this experiment, we shift the discussion of how we want this future to look earlier in time, so we can better prepare.
Better prepare for what? Evil AI labs running experiments without any ethical oversight? Shockingly evil, by the way:
no one’s livelihood depends on the judgment of an AI alone.
"Alone." How kind of them. By the way, it is incredibly despicable, even by the low low low standards of AI researchers, to run this sort of experiment on people looking for work. I couldn't believe the humans responsible let their stupid AI post ads on Indeed and LinkedIn. What scumbags.
An underappreciated source of nonsense in 21st century discourse is people watching YouTube instead of reading things. It doesn't appear this author read anything, preferring to be spooked and misled by a YouTube video.
trained them to play DOOM - honestly better than I do.
Maybe the author really really sucks at DOOM, but I think this is a false embellishment:
>> While the neurons can play the game better than a randomly firing player, they’re not very good. “Right now, the cells play a lot like a beginner who’s never seen a computer—and in all fairness, they haven’t,” Brett Kagan, chief scientific officer at Cortical Labs, says in the video. “But they show evidence that they can seek out enemies, they can shoot, they can spin. And while they die a lot, they are learning.” [https://www.smithsonianmag.com/smart-news/a-clump-of-human-b... ]
To play DOOM, the system feeds visual data to the neurons. For the neurons to react, they have to interpret that data in some way.
This is totally false - not even a misleading metaphor, just plain wrong. The neuronal computer doesn't get any visual information:
>> So how does a petri dish of brain cells play Doom when it doesn’t have any eyes? Or fingers? "We take a snapshot of the game with information like the player’s health and the position of enemies, pass it through a neural network, convert it into numbers, and send the data,” explains Cole. “This is called encoding – essentially turning the game state into signals the neurons can understand. The neurons then fire an output – move left, move right, walk forward, shoot or not shoot – which the system decodes and converts back into actions in the game." [https://www.theguardian.com/games/2026/mar/16/petri-dish-bra...]
I am also concerned about neuronal computing. But it doesn't really help anyone to spread childish ghost stories about it.
I really hate YouTube, by the way. My dad used to read newspapers and had interesting ideas. Now he watches a bunch of YouTube and he's a huge idiot. It's not (directly) because of age: nobody is immune to narcotic slop. I had to delete my account when I realized how much of my life and cognition I was wasting. I wish others would do the same.
I feel that "YouTube makes you an idiot" is a misdiagnosis. And one I hear frequently.
Books can make you an idiot too- I think of "Rich Dad, Poor Dad" or "Grit" or any number of pseudo-science best seller books. These books end up capturing the public imagination in big ways too- Grit caused some government policy in the US around when it was popular.
The difference, I suppose, is that YouTube works faster by having many different people presenting the same bad ideas that the algorithm has helped you to buy into.
On the other hand there are amazing and useful YouTube channels that I use all the time like Practical Engineering, Crafsman, Technology Connections, Park Tools, SciShow, Crash Course, and on and on.
>I feel that "YouTube makes you an idiot" is a misdiagnosis.
Signal/noise is much worse (arguably books are catching up thanks to LLMs)
People see emotional signals in youtube videos. They respond to vocal tone, facial expressions, these are known to circumvent critical thinking. Like if you examine crowds of science deniers the usual commonality is that they are having a parasocial relationship with a bunch of youtube creators who are nice to them and reinforce their beliefs. The actual content of the belief is irrelevant, if you are disagreeing with the belief, you are attacking their tribe. Not limited to science deniers either, you get this hacking of human tribal psychology even in stuff like people who watch computer game videos. They pick a few champions of their tribe and follow them without critical examination of the content. At least with a book, while this is still possible its much harder. Its also telling that a lot of cranks who published junk science have all migrated to youtube.
I dont think youtube makes you an idiot, so much as youtube content is designed to bypass your critical defenses and overwhelm you. It develops into a blind spot. People can be perfectly rational in most areas and then suddenly burp up some absolute nonsense they caught on youtube.
Oh and the best part, is when you point this out to someone they tend to go "Oh yeah that totally happens... except for my favourite youtube channel which does x and y and z and yes of course I buy all their products and donate to their charities"
The nice thing about books vs. YouTube is that it's much easier to critically interrogate books while you're reading them. That was the difference with my dad: he thought about what he read. He repeats what he listens to on YouTube.
I hate the proliferation of audiobooks too, by the way. It's the exact same problem.
To be fair, even reading 'good' books won't make you smart. I think the key is to be critical, which should be thought at a young age. Ikram Antaki dedicated most of her last years in teaching this in Mexico.
Anecdote: When I started studying economics I really agreed with a lot of what I read from economists like David Ricardo, Marx, Smith, etc. Then, I studied what other economist had to say and I could see how they disagreed with the former. This made me realize that I agreed with those people because their arguments 'made sense' to me, but that doesn't mean that what they said is completely true. This is something that has stayed with me, I always wonder how can something be wrong.
The Printing Press is good example, one of the first books was on "witch hunting", which panicked people, and lead to a lot of deaths. The first, 'conspiracy theory' to sweep over humans.
Humans are just highly susceptible to manipulation. YouTube is just taking it to next level. Like the difference in eating coca leaves, versus snorting coke.
There are a number of studies that show that Grit is either not a thing or there are better measures of success. It has been a long time since I have thought about it so I don't remember which papers in particular.
The point is that it doesn't really make sense to say they're "seeing" anything. You said
So… are the neurons on that chip seeing?
We all desperately want to say no.
But I can confidently say "no, that's totally childish, the neurons are clearly not seeing anything." And in fact it's not even especially clear that they're "playing DOOM" vs. hitting a biased random number generator in response to carefully preprocessed inputs that come from DOOM. There is a major distinction when the enemy positions are directly piped into the brain.
Again I share the ethical concern about this stuff. But your blog post is quite misleading.
you don't have to imagine too far - I made DOOM run through a series of pre-rendered images in markdown files as a stateless engine before [0] and the answer to your question is highly upto interpretation
You move, you plan, your actions have outcomes
Same question as if you're playing choose-your-own-adventure game storybook
That's not what I said, I said the blog post was false because the author thoughtlessly digested a YouTube video. It looks like the blog invented some details that weren't actually in the video.
But there are a lot of academic and research institutions that actually do have good Linux user management. I worked at a pediatric hospital, and the RHEL HPC admins did not mess around in terms of who was allowed to access which patients' data. As someone who was not an admin, it was a huge pain and it should have been. So this bug has pretty serious implications, seems like anyone at that hospital can abscond with a lot of deidentified data. [research HPC not as sensitive as the clinical stuff, which I think was all Windows Server]
reply