Hacker Newsnew | past | comments | ask | show | jobs | submit | CoolGuySteve's commentslogin

They're not a particularly fast card compared to something like a 5070, they have lots of VRAM.

That's why they were cheap before.

Also "Some stupid game", who woke up and made you king of hobbies.


I'm finding qwen 27b is comparable to sonnet but my self hosting has about 5 more 9s than whatever Anthropic's vibe coding. I also don't have to worry about the quality of the model I'm being served from day to day.

Probably the most damning fact about LLMs is just how poorly written their parent companies' systems are.


    > Probably the most damning fact about LLMs is just how poorly written their parent companies' systems are
I have been working on some work related to MCP and found some gaps in implementation in Claude and Codex. This is a relatively simple, well-defined spec and both Claude Code and Codex CLI have incomplete/incorrect implementations.

During this process of investigation, I checked the CC repo and noticed they had 5000+ issues open. Out of curiosity, I skimmed through them and many point to regressions, real bugs, simple changes, etc. Maybe they have some internal tracker they are using, but you would think that a company with functionally unlimited tokens and access to the best models would be able to use those tokens to get their own house in order.

My sense now is that there is a need for the industry to create a lot of hype right now so we see showmanship like the kernel compiler and the agent swarms building a semi-functional browser, etc....yet their own tooling has not fully implemented their own protocol (MCP) correctly. They need all of us to believe that these agents are more capable than they actually are; the more piles of tangled code you write and the more discipline you cede to their LLMs, the more dependent you are on those LLMs to even know what the code is doing. At some point, teams become incapable of teasing the code apart anymore because no one will understand it.

Peeking at the issues in the repos and seeing big gaps in functionality like Codex's missing support for MCP prompts and resources is like looking behind the curtain at reality.


QWEN3.5-Next-Coder does wonders. It's drawbacks are time to first token is 30 seconds to load the model and OpenCode has an unsolved timeout issue on this load, but otherwise once it's warmed up, it's entirely serviceable.

I've got a AMD395+ with 128GB, so running a ~46GB model gives me about 85k tokens, which gives me easily copy/paste/find/replace behavior; it mocks up new components; it can wire in some functionality, but that's usually at it's limits and requires more debugging.

I've been looking at how to schedule it using systemd to keep a wiki up to date with a long loaded project and breaks the "blank page" issue with extending behaviors in a side project.

I understand some of these larger models can do things faster and smarter, but I don't see how they can implement novel functionality required for the type of app I'm concerned with. If I just wanted to make endless CRUD or TODO apps, I'm betting I could figure out a loop that's mostly hands off.


But do you actually treat LLMs as glorified autocomplete or treat them as puzzle solvers where you give them difficult tasks beyond your own intellect?

Recently I wrote a data transformation pipeline and I added a note that the whole pipeline should be idempotent. I asked Claude to prove it or find a counterexample. It found one after 25 minutes of thinking; I reasonably estimate that it would take me far longer, perhaps one whole day. I couldn’t care less about using Claude to type code I already knew.


"give them difficult tasks beyond your own intellect?"

Lol no, I've yet to find a model with those properties. Sounds like a fast track to AI psychosis.

The domain I work in doesn't have enough public documentation for these models to be particularly helpful without a lot of handholding though.


I've been working on a luks+btrfs+systemd tool (for managing an encrypted raid1 pool). While I have worked with each individually, it's not obvious what kind of cases you have to handle when composing them together. A lot of it is simply emergent, and the status quo has been to do your best and then see what actually happens at runtime.

Documentation is helpful to describe high-level intentions, but the beauty is when you have access to source code. Now a good model can derive behavior from implementation instead of docs which are inherently limited.

I implemented the luks+btrfs part by hand a few years ago, and I resurrected the project a couple months ago. Using source code for local reference, Claude discovered so many major cases I missed, especially in the unhappy-path scenarios. Even in my own hand-written tests. And it helped me set up an amazing NixOS VM test system include reproduction tests on the libraries to see what they do in weird undocumented cases.

So I think "tasks beyond our intellect (and/or time and energy)" can be fitting. Otherwise I'd only be capable of polishing this project if luks+btfs+systemd were specifically my day job. I just can't fit so much in my head and working memory.


And it can fail in great ways. Last example: I asked claude for a non-trivial backup and recovery script using restic. I gave it the whole restic repo and it still made up parameters that don't exist in the code (but exist in a pull request that's been sitting not merged for 10+ months).


Interesting. I don't think I've seen hallucinations at that level when it's referencing source code.

Though my workflow always starts in plan mode where Claude is clearly more thorough (which is the reason it takes 10x as long as going straight to impl). I rarely skip it.


> I've yet to find a model with those properties

You can just look at examples like Knuth Claude’s cycles that solved the problem. I have no doubt that if Claude didn’t exist perhaps Knuth would come up with a solution anyways, but given a limited amount of time/patience Claude came up with a solution while Knuth did not. That’s what I meant here.

Similarly the problems I give to Claude are also in that category where I myself did not come up with a solution within a set amount of time, and instead of keep working on it manually I decided to give them to Claude.


This says more about you than the "intellect" of these nondeterministic probability programs.

Can you provide actual context to what was beyond your ability and how you're able to determine if the solution was correct?

Finding out that all these comments that reference the "magical incantation" tend to be full of hot air. Maybe yours is different.


> how you're able to determine if the solution was correct

I had hundreds of unit tests that did not trigger an assertion I added for idempotency. Claude wrote one that triggered an assertion failure. Simple as that. A counterexample suffices.


I've tried a few models and some are decent, including Qwens models. I've tried a few harnesses like Roo Code in VSCode to put things together that in theory emulate the experience I get from VSCode + Claude or Copilot, but I generally find the experience extremely limited and frustrating.

How have you set things up to have a good experience?


I'm using the qwen cli tool with a duckduckgo search skill that I made Claude write. It's like bootstrapping I guess

Once it can search for factual information online the smaller model size becomes less noticeable


People keep saying this and idk what I'm doing wrong. Using q8_0 on all the latest and greatest local models and they just don't come close to sonnet.

I've tried different harnesses, building my own etc.

They are reasonably close to haiku? Maybe?


You're not doing anything wrong, they are not comparable.

Claims to the tune of "this 0.5B local model running on my phone is almost as good as [large expensive model]" are common but greatly exaggerated, it's simply not true beyond the most basic use cases.

Only the much larger models (such as the 744B GLM-5) manage to come close, but nobody's running those locally.


Just to make one obvious critique your costs per token are probably about 1000x higher than the ones they provide.

I'm pretty sympathetic to Anthropic/OpenAI just because they are scaling a pretty new technology by 10x every year. It is too bad Google isn't trying to compete on coding models though, I feel like they'd do way better on the infra and stability side.


I've owned this GPU for 5 years already, it's fine


What do you run it on? And even then, I'm guessing your tokens per second are not great?


I get about 35-40tok/sec on a 3090.

It's actually about the same speed when accounting for how much more responsive my system is to Anthropic's saas infrastructure


I keep forgetting have a 3080 laying around... Gotta figure that out.


> Probably the most damning fact about LLMs is just how poorly written their parent companies' systems are.

This seems like a popular take, but I think it's the other way around. Them dogfooding cc with cc is proof that it can work, and that "code quality" doesn't really matter in the end.

Before cc claude.ai (equivalent of chatgpt) was meh. They were behind in features, behind in users, behind in mindshare. cc took them from "weirdos who use AI for coding" to "wait, you're NOT using cc? you freak" in ~1 year. And cc is a very big part of them reaching 1-2B$ monthly revenue.

Yes, it's buggy. Yes, the code is a mess (as per the leak, etc). But they're also the most used coding harness. And, on a technical side, having had cc as early as they did, helped them immensely on having users, having real-usage data, real-usage signals and so on. They trained the models on that data, and trained the models in sync with the harness. And it shows, their models are consistently the highest ranked both on benchmarks and on "vibes" from coders. Had they not have that, they would have lacked that real-world data.

And if you look at the competition it's even more clear. Goog is kidna nowhere with their gemini-cli, is all over the place with their antigravity-ex-windsurf, and while having really good generalist models, the general mindshare is just not there for coding. Same for oAI. They have an open-source, rust-based, "solid" cli, they have solid models (esp in code review, planning, architecture, bug fixing, etc) but they are not #1. Claude is with their cc.

So yeah, I think it's really the other way around. Having a vibe-coded, buggy, bad code solution, but being the first to have it, the first to push it, and the first to keep iterating on it is really what sets them apart. Food for thought on the future, and where coding is headed.


The soundtracks for SimCity 3000, 4, and the 5th one titled just "SimCity" are written specifically to be played while doing some fiddly micromanagement tasks.


Which brings us to: SpaceChem's soundtrack.


No because as far as we know 26.04 won't enable zswap or zram whereas Windows and MacOS both have memory compression technology of some sort. So Ubuntu will use significantly more memory for most tasks when facing memory pressure.

Apparently it's still in discussion but it's April now so seems unlikely.

Kind of weird how controversial it is considering DOS had QEMM386 way back in 1987.


Zswap is a no brainer. I have to wonder why the hesitancy.


interesting, I never realized macOS compresses memory in place before swapping.. Linux just dumps straight to disk, no wonder it freezes

Found good visual explainer on this - https://vectree.io/c/linux-virtual-memory-swap-oom-killer-vs...


QEMM386 for DOS did not have a memory compression feature. Only one of the later versions for Windows 3.1 did.


CPUs really weren't up to the job in the pre-Pentium/PowerPC world. Back then, zip files used to take an appreciable number of seconds to decompress, and there was a market for JPEG viewers written in hand-optimised assembly.

That's why SoftRAM gained infamy - they discovered during testing that swapping was so much faster than compression that the released version simply doubled the Windows swap file size and didn't actually compress RAM at all, despite their claims (and they ended up being sued into oblivion as a result...)

Over on the Mac, RAMDoubler really did do compression but it a) ran like treacle on the 030, b) needed to do a bunch of kernel hacks, so had compatibility issues with the sort of "clever" software that actually required most RAM, and c) PowerMac users tended to have enough RAM anyway.

Disk compression programs were a bit more successful - DiskDoubler, Stacker, DoubleSpace et al. ISTR that Microsoft managed to infringe on Stacker's patents (or maybe even the copyright?) in MS DOS 6.2, and had to hastily release DOS 6.22 with a re-written version free of charge as a result. These were a bit more successful because they coincided with a general reduction in HDD latency that was going on at roughly the same time.


I had a memory management problem so I introduced GC/ref counting and now I have a non-deterministic memory management problem.


Ref counting is deterministic. Rust memory management is also deterministic: the memory is freed exactly when the owner of the data gets out of scope (and the borrow checker guarantees at compile time there is no use after that).


Cool now use the reference on another thread.


If you would use Rust, you would know that problem is solved too.

Rust solves a lot of problems, and introduces others

The promiscuous package management, chiefly. Not unusual for building a imlle programme in Rust brings in 200+ crates, from unknown authors on the Internet...

What could possibly go wrong?


Even at 100k employees I’m still dumbfounded by that number. What do all these people do all day?


1. They maintain and sell one of the largest relational databases.

2. They're the primary maintainer of one of the largest programming languages.

3. They do tons of HR/ERP type software.

4. They have a supply chain division (my company is a direct competitor, and we have 2000 employees--it's a drop in the bucket, but a few thousand here, a few thousand there and it starts to add up. Afaik, their supply chain org is bigger than ours).

5. Other things I probably don't know about.

Many of these things come with swarms of consultants who implement the software for companies that don't have any internal technical competency, which swells the number of workers by a lot.

Don't get me wrong, I'm not remotely a fan, I like to quote Bryan Cantrill's rant. However, they do a lot of things.


>> Many of these things come with swarms of consultants who implement the software for companies that don't have any internal technical competency,

I have some anecdotal evidence for this. I worked at a medium sized family owned business. They were going through a massive ERP upgrade/replacement. One of the bids was from Oracle. The company was able to essentially test drive each company they were reviewing to see if the software was going to be a good fit.

Oracle's sales team was like a having a football on site. They sent over no less than about 20 people to swarm our pretty small office, barge into the dev spaces and generally annoy the fuck out of everybody for several months. The other vendors? They sent one, maybe two people to work alongside us as we test drove their software.

It was funny being in those meetings listening to people talk about the Oracle people. Nobody even remembered how good or bad their software was. Every single comment was about how overbearing and pushy their sales people were.

Needless to say, we went with a different company.


That sales process is directly tied to the type of customer they're aiming for, which is larger than a "medium-sized family-owned business".

They mis-aligned but for someone like Boeing or United, they'd go gaga over the footy-crowd.


They also own multiple other huge companies that had tens of thousands of their own employees working in completely different areas (Netsuite, Cerner, Acme, etc)


6. Lawyers


"The first thing we do, let's AI all the lawyers" ?


Also their cloud

And all the supporting legal team of course.


No better proof that they're a huge company than that I could forget about an entire public cloud offering. Good point.


I remember reading this post years ago, and it has stuck in my brain ever since: https://news.ycombinator.com/item?id=18442941

So I suspect the answer is: they need _at least_ 10x as many engineers to get things done as you would expect. Maybe more like 50x


That was a highlights grade comment ( https://news.ycombinator.com/highlights )

And the last comment by 'oraguy' - I hope he just picked up another id because "never work for Oracle again" ...


what a horrible horrible read :|

Clearly shows that either no one understands the whole picture anymore or that it became so diverse custom, that this is the only way of handling this now.

I think though that these companies are more business companies than tech companies and move themselves into this nightmare.


It's even more wild when you realize that other similar-sized enterprise companies don't have all that and either leave bugs to sit around for decades, or randomly break shit trying to fix them.


That is really wild


Unless you have worked with Oracle or other big enterprises, you may not realize the scale of how these companies operate and the breadth of what they actually do. Just by looking at their product page[0] you can see they offer software, hardware, cloud, consulting, support, and even financing solutions. In addition to the technology and product people (of which there are many), you also need HR, sales, marketing, accounting, support, etc for the entire global organization.

Sure, 100,000 people is a lot, but Oracle also does a lot.

[0] https://www.oracle.com/products/


This! They do _everything_.

In the real world, there are a lot of things you need to run a business: HR, ERP, Financing, Cloud, Compliance, CRM, etc. There is really only one company who can sell them all to you on one piece of paper, and that's Oracle.


Salesforce does one aspect of what Oracle can do (Access as a Service) and they have 83,000 employees. Oracle may actually be pretty lean.


Oracle sells alot of software that is accompanied by hordes of consultants to set it up.

Last F50 I was at did a PeopleSoft migration. We probably had 400 Oracle employees pass through the doors over 2 years helping to get it off the ground.

Most Enterprises don't just buy software and that's it. They buy software + support to implement it for their business.


Sure but what did those guys do all day? 400 people is a lot of people


Write code to connect this system with that system. Teach people what setting does what. Integrate with Entra ID. Create custom reports that hordes of Executive on our side want. Scale out the system from undersized nodes we originally gave it. That's all I picked up by just listening to them. I wasn't involved in the project, just sat nearby listening to it.

This is extremely customizable software that is designed to pretty much run your entire business and touched by over 40k employees. It requires a ton of care and feeding. There is plenty of people who dedicate themselves to PeopleSoft. Zip Recruiter is showing 5 jobs near me for "PeopleSoft Administrator"


The need to teach people what setting does what is a sort of consulting moat that AI dismantles when it can access the right context.


They don't make any of the documentation for those settings easy to find or understand because the support contracts make them so much money.


Before, that could create a moat.

Soon, it will be table stakes to put scattered internal communications, notes, documents into an AI’s knowledge base, where the information can no longer hide.

When that fails, the AI can read the code itself, so that the settings and how to change them are easily explained in simple terms. Actually, this is possibly even better than letting the scattered internal information serve as an intermediate layer.


That works for small customers who actually want to spend time customizing things themselves. Big customers love having to sign support contracts, because it gives them someone to blame when something goes wrong. Nobody else gets to touch any of the settings or knobs to avoid breaking anything.

Being big is the actual moat.


Creating powerpoints. Presenting the powerpoints to others in synchronous meetings.


The training team and what's called 'Change Management' for an F50 company that's spread across the globe implementing a new application like an ERP could be 100 people by itself. It's extremely complex and hard to do those kinds of projects which is why many ERP migrations take a decade to complete if not fail entirely.


Probably had a lot of meetings


"Well look, I already told you! I deal with the goddamn customers so the engineers don't have to! I have people skills! I am good at dealing with people! Can't you understand that? What the hell is wrong with you people?!"


plus yearly support maintenance


Almost certainly a large amount of support staff, so management/HR/IT etc... Then you've got your customer account managers, sales, lawyers/finance etc.... Given they do an insane amount of B2B and government sales I can see this being easy to reach tbh. Governement contract processes require an insane amount of bureacracy and negotations.


I’m guessing development is so slow that they have stacks of teams working in parallel to accomplish what 1 team could normally.


When you send your database a query, who do you think is gathering those tables?


More than 70% of the employees are probably Sales/Support/Service -- on par with any large enterprise firm (Think Cisco/Salesforce/ServiceNow etc)


Well, whatever Oracle is doing, which brings us back to a question very similar to your original one.


Solaris ?


Didn't they fire most of Solaris devs some time ago? Incidentally, Solaris been stuck on 11.4.x for, well, forever and a half..


Me too. Anyone here to enlighten us?


It's really funny that after going through all those fonts it landed on Ubuntu Mono for me which is what I use anyways to code in my terminal.

I wonder if it's Stockholm syndrome or if I really do prefer it. It's a totally fine font, I've never felt the need to change it. All the default open source mono fonts seem completely adequate I suppose.


> The Chinese models are open source because they are not state of the art

I think geohot is burying the lead in this text in his post with a lot of speculation.

It's not not that these specific models will become closed it's that the hardware/hosting vendors have an incentive to train models where inference is custom tuned to their chip's dimensions and VRAM.

The Chinese models do a great job of showing what's capable on consumer/prosumer hardware because of export restrictions but anyone entering the hardware space has the same incentives to undercut the frontier labs so they can sell more hardware.

It's also not clear if being at the forefront of inference quality really matters. The open source models appear to be doing a fine enough job of keeping up even if they're a few months behind. So it seems like there's not much of a technology moat for these labs other than the capital costs of training/serving.


Back when I worked at Apple I would just try it in whatever I had installed. If it didn't reproduce I'd write "Cannot reproduce in 10.x.x" and close it. Maybe a third were like that, duplicates of some other issue that was resolved long ago.

Anyone that attached a repro file to their issue got attention because it was easy enough to test. Sometimes crash traces got attention, I'd open the code and check out what it was. If it was like a top 15 crash trace then I'd spend a lot longer on it.

If the ticket was long and involved like "make an iMovie and tween it in just such and such a way" then probably I'd fiddle around for 10-15 minutes before downgrading its priority and hope a repro file would come about.

There were a bunch of bug reports for a deprecated codec that I closed and one guy angrily replied that I couldn't just close issues I didn't want to fix!

Guess what buddy, nobody's ever going to fix it.

The oldest bug like that I ever fixed was a QuickDraw bug that was originally written when I was 8 years old but it was just an easy bounds check one liner.

But the mistake OP is making is assuming this one thing that annoyed him somehow applies to the whole Apple org. Most issues were up to engineers and project managers to prioritize, every team had their own process when I was there.


> But the mistake OP is making is assuming this one thing that annoyed him somehow applies to the whole Apple org. Most issues were up to engineers and project managers to prioritize, every team had their own process when I was there.

Except this same shit keeps happening with multiple teams.

Judging from your mention of QuickDraw, which was removed entirely from macOS in 2012, perhaps your Apple experience is now out of date.


[flagged]


What specifically do you claim I'm making up?


That the ~50000 engineers at Apple are conspiring to close your tickets in the exact same way. It's ridiculous


> That the ~50000 engineers at Apple are conspiring to close your tickets in the exact same way. It's ridiculou

It's pretty clear from experience that the organization policy is to not provide feedback on bug submissions. Getting a 'check it if still reproduces or we'll close it in two weeks' message after 3 years is actually a fast turnaround.

Best I've gotten was on an issue I routed to a friend who worked at Apple who promised it would get looked at, but that I wouldn't hear back.

Microsoft wouldn't fix my issues either, but at least they got back to me in a timely fashion. Usually telling me it was a known issue that they weren't going to fix.


You don’t hear back because almost always your bug is a duplicate of some other one. They can’t share the original with you because it contains data from another customer or from inside the company.

Almost nobody is the first reporter in an OS with billions of users. The only useful thing about those long dupe lists was being able to scan them for one with easier repro steps.

But sometimes that duplicate marking is wrong or some subtly different issue so they ask you if it still reproduces in whatever version contains the fix before closing it.


That makes sense. But when you take 3-5 years to respond to my bug report, I'm going to take at least 3 months to respond to your response. And I'm probably not filing more bugs, because chances are I won't be at my current employer by the time you reply.

When you consitently burn bug reporters, sooner or later there's nobody to file bugs.


Because that's probably how long it took for someone to prioritize it.

Even if it's not fixed by the dupe ticket, the volume of bug reports makes it almost certain another ticket for the same issue will come up. And if it doesn't then it probably wasn't that relevant to anyone.


Not my tickets specifically. I don't think they're out to get me individually. On the contrary, this is a common practice, which affects many developers. I just happen to be relatively loud, as far as blogging is concerned.


Yes I understand that. ~50000 engineers aren't conspiring to close all tickets that way. It's a stupid line of thinking.

More than likely your steps to reproduce are too laborious to receive attention relative to the value fixing the bug would provide. That's why they're asking you to verify it still happens. Seems pretty simple right?

There's also a strong chance your ticket was linked as a duplicate of some other issue that was fixed in the beta and they want you to verify that's the case but they won't expose their internal issue to you for a variety of reasons.


> ~50000 engineers aren't conspiring to close all tickets that way.

I didn't say that either. It's happened to me only sporadically, but multiple times.

I agree with you that teams within Apple manage their own tickets. Perhaps some individual teams are declaring bug bankruptcy at some point, so only their bugs would go out for verification. I don't really know. I wish I did. What I do know is that multiple teams have done this at different points.

There's indisputably a company-wide DevBugs canned response for this. It's the same exact language every time. You can even Google it.

> It's a stupid line of thinking.

Please respect the HN guidelines: https://news.ycombinator.com/newsguidelines.html

> More than likely your steps to reproduce are too laborious to receive attention. That's why they're asking you to do it.

It was much more laborious for me, because I do not install the macOS betas.

> Seems pretty simple right?

No, it doesn't explain why specifically, after 3 years, they were suddenly asking me to verify with macOS 26.4 beta 4.


Would be nice if zswap could be configured to have no backing cache so it could completely replace zram. Having two slightly different systems is weird.

There's not really any difference between swap on disk being full and swap in ram being full, either way something needs to get OOM killed.

Simplifying the configuration would probably also make it easier to enable by default in most distros. It's kind of backwards that the most common Linux distros other than ChromeOS are behind Mac and Windows in this regard.


This is actually something we're actively working on! Nhat Pham is working on a patch series called "virtual swap space" (https://lwn.net/Articles/1059201/) which decouples zswap from its backing store entirely. The goal is to consolidate on a single implementation with proper MM integration rather than maintaining two systems with very different failure modes. It should be out in the next few months, hopefully.


Very exciting, thanks!


> Would be nice if zswap could be configured to have no backing cache

You can technically get this behavior today using /dev/ram0 as a swap device, but it's very awkward and almost certainly a bad idea.


And you can use a zram-backed ram0 if you're still undecided :D


Very much agreed. I feel like distros still regularly get this wrong (as evidence, Ubuntu, PopOS and Fedora all have fairly different swap configs from each other).


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: