It's good to see more competition, and open source, but I'd be much more excited...

jstummbillig · 2025-11-06T17:57:47 1762451867

> I'd be much more excited to see what level of coding and reasoning performance can be wrung out of a much smaller LLM + agent

Well, I think you are seeing that already? It's not like these models don't exist and they did not try to make them good, it's just that the results are not super great.

And why would they be? Why would the good models (that are barely okay at coding) be big, if it was currently possible to build good models, that are small?

Of course, new ideas will be found and this dynamic may drastically change in the future, but there is no reason to assume that people who work on small models find great optimizations that frontier models makers, who are very interested in efficient models, have not considered already.

HarHarVeryFunny · 2025-11-06T18:08:54 1762452534

Sure, but that's the point ... today's locally runnable models are a long way behind SOTA capability, so it'd be nice to see more research and experimentation in that direction. Maybe a zoo of highly specialized small models + agents for S/W development - one for planning, one for coding, etc?

Uehreka · 2025-11-06T18:27:47 1762453667

If I understand transformers properly, this is unlikely to work. The whole point of “Large” Language Models is that you primarily make them better by making them larger, and when you do so, they get better at both general and specific tasks (so there isn’t a way to sacrifice generality but keep specific skills when training a small models).

I know a lot of people want this (Apple really really wants this and is pouring money into it) but just because we want something doesn’t mean it will happen, especially if it goes against the main idea behind the current AI wave.

I’d love to be wrong about this, but I’m pretty sure this is at least mostly right.

maciejgryka · 2025-11-06T19:04:00 1762455840

I think this is a description of how things are today, but not an inherent property of how the models are built. Over the last year or so the trend seems to be moving from “more data” to “better data”. And I think in most narrow domains (which, to be clear, general coding agent is not!) it’s possible to train a smaller, specialized model reaching the performance of a much larger generic model.

Disclaimer: this is pretty much the thesis of a company I work for, distillabs.ai but other people say similar things e.g. https://research.nvidia.com/labs/lpr/slm-agents/

XenophileJKO · 2025-11-06T20:32:55 1762461175

Actually there are ways you might get on device models to perform well. It is all about finding ways to have a smaller number of weights work efficiently.

One way is reusing weights in multiple decoders layers. This works and is used in many on-device models.

It is likely that we can get pretty high performance with this method. You can also combine this with low parameter ways to create overlapped behavior on the same weights as well, people had done LORA on top of shared weights.

Personally I think there are a lot of potential ways that you can cause the same weights to exhibit "overloaded" behaviour in multiple places in the same decoder stack.

Edit: I believe this method is used a bit for models targeted for the phone. I don't think we have seen significant work on people targeting say a 3090/4090 or similar inference compute size.

martinald · 2025-11-06T22:23:58 1762467838

The issue isn't even 'quality' per se (for many tasks a small model would do fine), its for "agentic" workflows it _quickly_ runs out of context. Even 32GB VRAM is really very limiting.

And when I mean agentic, i mean something even like this - 'book a table from my emails', which involves looking at 5k+ tokens of emails, 5k tokens of search results, then confirming with the user etc. It's just not feasible on most hardware right now - even if the models are 1-2GB, you'll burn thru the rest in context so quickly.

HarHarVeryFunny · 2025-11-06T18:50:20 1762455020

Yeah - the whole business model of companies like OpenAI and Anthropic, at least at the moment, seems to be that the models are so big that you need to run them in the cloud with metered access. Maybe that could change in the future to sale or annual licence business model if running locally became possible.

I think scale helps for general tasks where the breadth of capability may be needed, but it's not so clear that this needed for narrow verticals, especially something like coding (knowing how to fix car engines, or distinguish 100 breeds of dog is not of much use!).

Aurornis · 2025-11-06T19:40:25 1762458025

> the whole business model of companies like OpenAI and Anthropic, at least at the moment, seems to be that the models are so big that you need to run them in the cloud with metered access.

That's not a business model choice, though. That's a reality of running SOTA models.

If OpenAI or Anthropic could squeeze the same output out of smaller GPUs and servers they'd be doing it for themselves. It would cut their datacenter spend dramatically.

Majromax · 2025-11-06T20:26:26 1762460786

> If OpenAI or Anthropic could squeeze the same output out of smaller GPUs and servers they'd be doing it for themselves.

First, they do this; that's why they release models at different price points. It's also why GPT-5 tries auto-routing requests to the most cost-effective model.

Second, be careful about considering the incentives of these companies. They all act as if they're in an existential race to deliver 'the' best model; the winner-take-all model justifies their collective trillion dollar-ish valuation. In that race, delivering 97% of the performance at 10% of the cost is a distraction.

cubefox · 2025-11-07T02:30:04 1762482604

> > If OpenAI or Anthropic could squeeze the same output out of smaller GPUs and servers they'd be doing it for themselves.

> First, they do this; that's why they release models at different price points.

No, those don't deliver the same output. The cheaper models are worse.

> It's also why GPT-5 tries auto-routing requests to the most cost-effective model.

These are likely the same size, just one uses reasoning and the other doesn't. Not using reasoning is cheaper, but not because the model is smaller.

gunalx · 2025-11-07T09:12:46 1762506766

But they also squesed a 80% cut in O3 at some point, supposedly purely on inference or infra optimization

anabis · 2025-11-11T00:48:19 1762822099

> delivering 97% of the performance at 10% of the cost is a distraction.

Not if you are running RL on that model, and need to do many roll-outs.

Uehreka · 2025-11-06T20:33:29 1762461209

No I don’t think it’s a business model thing, I’m saying it may be a technical limitation of LLMs themselves. Like, that that there’s no way to “order a la carte” from the training process, you either get the buffet or nothing, no matter how hungry you feel.

ctoth · 2025-11-07T01:17:28 1762478248

Unless you're programming a racing sim or maybe a CRUD app for a local Kennel Club, perhaps?

I actually find that things which make me a better programmer are often those things which have the least overlap with it. Like gardening!

Aurornis · 2025-11-06T19:39:00 1762457940

> today's locally runnable models are a long way behind SOTA capability

SOTA models are larger than what can be run locally, though.

Obviously we'd all like to see smaller models perform better, but there's no reason to believe that there's a hidden secret to making small, locally-runnable models perform at the same level as Claude and OpenAI SOTA models. If there was, Anthropic and OpenAI would be doing it.

There's research happening and progress being made at every model size.

zhouxiaolinux · 2025-11-07T09:28:19 1762507699

I think SLM is developing very fast. A year ago, I couldn't have imagined a decent thinking model as Qwen, and now it seems full of promise

prmph · 2025-11-06T21:15:25 1762463725

You're still missing the point. The comment you're responding to is talking about specialized models

SchemaLoad · 2025-11-06T22:09:34 1762466974

The point is still valid. If the big companies could save money running multiple small specialised models on cheap hardware, they wouldn't be spending billions on the highest spec GPUs.

oofbey · 2025-11-07T03:32:05 1762486325

You want more research on small language models? You're confused. There is already WAY more research done on small language models (SLM) than big ones. Why? Because it's easy. It only takes a moderate workstation to train an SLM. So every curious Masters student and motivated undergrad is doing this. Lots of PhD research is done on SLM because the hardware to train big models is stupidly expensive, even for many well-funded research labs. If you read Arxiv papers (not just the flashy ones published by companies with PR budgets) most of the research is done on 7B parameter models. Heck, some NeurIPS papers (extremely competitive prestigious) from _this year_ are being done on 1.5B parameter models.

Lack of research is not the problem. It's fundamental limitations of the technology. I'm not gonna say "there's only so much smarts you can cram into a 7B parameter model" - because we don't know that yet for sure. But we do know, without a sliver of a doubt, that it's VASTLY EASIER to cram a smarts into a 70B parameter model than a 7B param model.

HarHarVeryFunny · 2025-11-07T12:42:37 1762519357

It's not clear if the ultimate SLMs will come from teams with less computing resources directly building them, or from teams with more resources performing ablation studies etc on larger models to see what can be removed.

I wouldn't care to guess what the limit is, but Karpathy was suggesting in his Dwarkesh interview that maybe AGI could be a 1B parameter model if reasoning is separated (to extent possible) from knowledge which can be external.

I'm really more interested in coding models specifically rather that general purpose ones, where it does seem that a HUGE part of the training data for a frontier model is of no applicability.

oofbey · 2025-11-07T16:26:38 1762532798

That’s backwards. New research and ideas are proven on small models. Lots and lots of ideas are tested that way. Good ideas get scaled up to show they still work on medium sized models. The very best ideas make their way into the code for the next huge training runs, which can cost tens or hundreds of millions of dollars.

Not to nitpick words, but ablation is the practice of stripping out features of an algorithm or technique to see which parts matter and how much. This is standard (good) practice on any innovation, regardless of size.

Distillation is taking power / capability / knowledge from a big model and trying to preserve it in something smaller. This also happens all the time, and we see very clearly that small models aren’t as clever as big ones. Small models distilled from big ones might be somewhat smarter than small models trained on their own. But not much. Mostly people like distillation because it’s easier than carefully optimizing the training for a small model. And you’ll never break new ground on absolute capabilities this way.

HarHarVeryFunny · 2025-11-07T18:15:16 1762539316

> Not to nitpick words, but ablation is the practice of stripping out features of an algorithm ...

Ablation generally refers to removing parts of a system to see how it performs without them. In the context of an LLM it can refer to training data as well as the model itself. I'm not saying it'd be the most cost-effective method, but one could certainly try to create a small coding model by starting with a large one that performs well, and seeing what can be stripped out of the training data (obviously a lot!) without impacting the performance.

oofbey · 2025-11-07T23:01:56 1762556516

ML researchers will sometimes vary the size of the training data set to see what happens. It’s not common - except in scaling law research. But it’s never called “ablation”.

cantor_S_drug · 2025-11-06T18:28:34 1762453714

In CS algorithms, we have space vs time tradeoffs.

In LLMs, we will have bigger weights vs test-time compute tradeoffs. A smaller model can get "there" but it will take longer.

refulgentis · 2025-11-06T19:10:57 1762456257

I have spent the last 2.5 years living like a monk to maintain an app across all paid LLM providers and llama.cpp.

I wish this was true.

It isn't.

"In algorithms, we have space vs time tradeoffs, therefore a small LLM can get there with more time" is the same sort of "not even wrong" we all smile about us HNers doing when we try applying SWE-thought to subjects that aren't CS.

What you're suggesting amounts to "monkeys on typewriters will write entire works of Shakespeare eventually" - neither in practice, nor in theory, is this a technical claim, or something observable, or even stood up as a one-off misleading demo once.

cantor_S_drug · 2025-11-06T20:07:18 1762459638

If "not even wrong" is more wrong than wrong, then is 'not even right" more right than right.

To answer you directly, a smaller SOTA reasoning model with a table of facts can rederive relationships given more time than a bigger model which encoded those relationships implicitly.

Aurornis · 2025-11-06T19:42:43 1762458163

> In LLMs, we will have bigger weights vs test-time compute tradeoffs. A smaller model can get "there" but it will take longer.

Assuming both are SOTA, a smaller model can't produce the same results as a larger model by giving it infinite time. Larger models inherently have more room for training more information into the model.

No amount of test-retry cycle can overcome all of those limits. The smaller models will just go in circles.

I even get the larger hosted models stuck chasing their own tail and going in circles all the time.

yorwba · 2025-11-06T21:13:33 1762463613

It's true that to train more information into the model you need more trainable parameters, but when people ask for small models, they usually mean models that run at acceptable speeds on their hardware. Techniques like mixture-of-experts allow increasing the number of trainable parameters without requiring more FLOPs, so they're large in one sense but small in another.

And you don't necessarily need to train all information into the model, you can also use tool calls to inject it into the context. A small model that can make lots of tool calls and process the resulting large context could obtain the same answer that a larger model would pull directly out of its weights.

naasking · 2025-11-07T15:14:48 1762528488

> No amount of test-retry cycle can overcome all of those limits. The smaller models will just go in circles.

That's speculative at this point. In the context of agents with external memory, this isn't so clear.

woctordho · 2025-11-07T01:03:34 1762477414

Almost all training data are on the internet. As long as the small model has enough agentic browsing ability, given it enough time it will retrieve the data from the internet.

lossolo · 2025-11-06T21:28:59 1762464539

This doesn't work like that. An analogy would be giving a 5 year old a task that requires the understanding of the world of an 18 year old. It doesn't matter whether you give that child 5 minutes or 10 hours, they won't be capable of solving it.

HarHarVeryFunny · 2025-11-06T22:17:57 1762467477

I think the question of what can be achieved with a small model comes down to what needs knowledge vs what needs experience. A small model can use tools like RAG if it is just missing knowledge, but it seems hard to avoid training/parameters where experience is needed - knowing how to perceive then act.

There is obviously also some amount (maybe a lot) of core knowledge and capability needed even to be able to ask the right questions and utilize the answers.

lossolo · 2025-11-07T14:11:53 1762524713

Small models handle simple, low context tasks most of the time correctly. But for more complex tasks, they fail due to insufficient training capacity and too few parameters to integrate the necessary relationships.

nkmnz · 2025-11-07T00:01:20 1762473680

What if you give them 13 years?

lossolo · 2025-11-07T14:07:37 1762524457

Nothing will change. They will go out of context and collapse into loops.

nkmnz · 2025-11-08T12:27:45 1762604865

I mean the 5yo child, not the LLM

fluoridation · 2025-11-07T03:15:48 1762485348

Then they're not a 5-year-old anymore.

homarp · 2025-11-06T22:32:00 1762468320

but in 13 years, will they be capable?

lossolo · 2025-11-07T14:08:00 1762524480

No. They will go out of context and collapse into loops.

andai · 2025-11-06T23:48:01 1762472881

Actually it depends on the task. For many tasks, a smaller model can handle it, and it gets there faster!

naasking · 2025-11-07T15:12:43 1762528363

> Why would the good models (that are barely okay at coding) be big, if it was currently possible to build good models, that are small?

Because nobody tried yet using recent developments.

> but there is no reason to assume that people who work on small models find great optimizations that frontier models makers, who are very interested in efficient models, have not considered already.

Sure there is: they can iterate faster on small model architectures, try more tweaks, train more models. Maybe the larger companies "considered it", but a) they are more risk-averse due to the cost of training their large models, b) that doesn't mean their conclusions about a particular consideration are right, empirical data decides in the end.

a-dub · 2025-11-06T17:33:57 1762450437

"open source" means there should be a script that downloads all the training materials and then spins up a pipeline that trains end to end.

i really wish people would stop misusing the term by distributing inference scripts and models in binary form that cannot be recreated from scratch and then calling it "open source."

emsign · 2025-11-06T18:01:29 1762452089

They'd have to publish or link the training data, which is full of copyrighted material. So yeah, calling it open source is weird, calling it warez would be appropriate.

oceanplexian · 2025-11-07T16:38:23 1762533503

They should release it then. China doesn't have a problem stealing and distributing copyrighted material.

nyrp · 2025-11-06T17:52:15 1762451535

> binary form that cannot be recreated from scratch

Back in my day, we called it "freeware"

poly2it · 2025-11-06T17:57:04 1762451824

You have more rights over a freely licensed binary file than over a freeware file.

moffkalast · 2025-11-06T17:53:28 1762451608

I'd agree but we're beyond hopelessly idealistic. That sort of approach only helps your competition who will use it to build a closed product and doesn't give anything of worth to people who want to actually use the model because they have no means to train it. Hell most people can barely scrape up enough hardware to even run inference.

Reproducing models is also not very ecological in when it comes down to it, do we really all need to redo the training that takes absurd amounts of power just to prove that it works? At least change the dataset to try and get a better result and provide another datapoint, but most people don't have the knowhow for it anyway.

Nvidia does try this approach sometimes funnily enough, they provide cool results with no model in hopes of getting people to buy their rented compute and their latest training platform as a service...

TheBicPen · 2025-11-07T06:32:26 1762497146

> I'd agree but we're beyond hopelessly idealistic. That sort of approach only helps your competition who will use it to build a closed product

That same argument can be applied to open-source (non-model) software, and is about as true there. It comes down to the business model. If anything, crating a closed-sourced copy of a piece of FOSS software is easier than an AI model since running a compiler doesn't cost millions of dollars.

danielmarkbruce · 2025-11-06T17:51:51 1762451511

"open source" has come to mean "open weight" in model land. It is what it is. Words are used for communication, you are the one misusing the words.

You can update the weights of the model, continue to train, whatever. Nobody is stopping you.

a-dub · 2025-11-06T20:24:34 1762460674

it still doesn't sit right. sure it's different in terms of mutability from say, compiled software programs, but it still remains not end to end reproducible and available for inspection.

these words had meaning long before "model land" became a thing. overloading them is just confusing for everyone.

danielmarkbruce · 2025-11-06T21:52:52 1762465972

It's not confusing, no one is really confused except the people upset that the meaning is different in a different context.

On top of that, in many cases a company/group/whoever can't even reproduce the model themselves. There are lots of sources of non-determinism even if folks are doing things in a very buttoned up manner. And, when you are training on trillions of tokens, you are likely training on some awful sounding stuff - "Facebook is trained llama 4 on nazi propaganda!" is not what they want to see published.

How about just being thankful?

a-dub · 2025-11-07T00:05:07 1762473907

i disagree. words matter. the whole point of open source is that anyone can look and see exactly how the sausage is made. that is the point. that is why the word "open" is used.

...and sure, compiling gcc is nondeterministic too, but i can still inspect the complete source from where it comes because it is open source, which means that all of the source materials are available for inspection.

danielmarkbruce · 2025-11-07T01:20:49 1762478449

The point of open source in software is as you say. It's just not the same thing though. Using words and phrases differently in different fields is common.

a-dub · 2025-11-07T05:02:24 1762491744

...and my point is that it should be.

the practice of science itself would be far stronger if it took more pages from open source software culture.

h33t-l4x0r · 2025-11-07T07:21:01 1762500061

I agree that they should say "open weight" instead of "open source" when that's what they mean, but it might take some time for people to understand that it's not the same thing exactly and we should allow some slack for that.

a-dub · 2025-11-07T14:53:58 1762527238

no. truly open source models are wonderful and remarkable things that truly move the needle in education, understanding, distributed collaboration and the advancement of the state of the art. redefinition of the terminology reduces incentive to strive for the wonderful goal that they represent.

HarHarVeryFunny · 2025-11-07T16:05:41 1762531541

There is a big difference between open source for something like the linux kernel or gcc where anyone with a home PC can build it, and any non-trivial LLM where it takes cloud compute and costs a lot to train it. No hobbyist or educational institution is going to be paying for million dollar training runs, probably not even thousand dollar ones.

a-dub · 2025-11-07T17:09:47 1762535387

"too big to share." nope. sharing the finished soup base, even if well suited for inclusion in other recipes, is still different from sharing the complete recipe. sharing the complete recipe encourages innovation in soup bases, including bringing the cost down for making them from scratch.

danielmarkbruce · 2025-11-07T18:08:54 1762538934

There is an enormous amount of information in the public domain about building models. In fact, once you get into the weeds you'll realize there is too much and in many cases (not all, but many) the very specific way something was done or what framework they used or what hardware configuration they had was just a function of what they have or have experience with etc. One could spend a lifetime just trying to repro olmo's work or a lot of the huggingface stuff....

mensetmanusman · 2025-11-07T03:18:38 1762485518

Weights are meaningless without training data and source.

antiframe · 2025-11-07T03:26:45 1762486005

I get a lot of meaning out of weights and source (without the training data), not sure about you. Calling it meaningless seems like exaggeration.

mensetmanusman · 2025-11-07T14:53:18 1762527198

Can you change the weights to improve?

HarHarVeryFunny · 2025-11-07T16:01:55 1762531315

You can fine tune without the original training data, which for a large LLM is typically going to mean using LoRA - keeping the original weights unchanged and adding separate fine-tuning weights.

danielmarkbruce · 2025-11-11T01:10:36 1762823436

it's a bunch of numbers. Of course you can change them.

HarHarVeryFunny · 2025-11-06T17:59:05 1762451945

Yeah, but "open weights" never seems to have taken off as a better description, and even if you did have the training data + recipe, the compute cost makes training it yourself totally impractical.

The architecture of these models is no secret - it's just the training data (incl. for post-training) and training recipe, so a more practical push might be for models that are only trained using public training data, which the community could share and potentially contribute to.

andai · 2025-11-06T23:49:09 1762472949

The meaning of Open Source

1990: Free Software

2000: Open Source: Finally we sanitized ourselves of that activism! It was scaring away customers!

2010: Source is available (under our very restrictive license)

2020: What source?

rurban · 2025-11-08T06:03:10 1762581790

2025: What prompt?

stingraycharles · 2025-11-06T18:04:06 1762452246

With these things it’s always both at the same time: these super grandiose SOTA models are only making improvements mostly because of optimizations, and they’re just scaling our as far as they can.

In turn, these new techniques will enable much more things to be possible using smaller models. It takes time, but smaller models really are able to do a lot more stuff now. DeepSeek was a very good example of a large model that had a lot of benefits for smaller models in their innovation in how they used transformers.

Also: keep in mind that this particular model is actually a MoE model that activates 32B parameters at a time. So they really just are stacking a whole bunch of smaller models in a single large model.

robotresearcher · 2025-11-07T00:45:42 1762476342

Yes, I am also super interested in cutting the size of models.

However, in a few years today’s large models will run locally anyhow.

My home computer had 16KB RAM in 1983. My $20K research workstation had 192MB of RAM in 1995. Now my $2K laptop has 32GB.

There is still such incredible pressure on hardware development that you can be confident that today’s SOTA models will be running at home before too long, even without ML architecture breakthroughs. Hopefully we will get both.

Edit: the 90’s were exciting for compute per dollar improvements. That expensive Sun SPARC workstation I started my PhD with was obsolete three years later, crushed by a much faster $1K Intel Linux beige box. Linux installed from floppies…

moregrist · 2025-11-07T01:01:02 1762477262

> My home computer had 16KB RAM in 1983. My $20K research workstation had 192MB of RAM in 1995. Now my $2K laptop has 32GB.

You’ve picked the wrong end of the curve there. Moore’s law was alive and kicking in the 90s. Every 1-3 years brought an order of magnitude better CPU and memory. Then we hit a wall. Measuring from the 2000s is more accurate.

My desktop had 4GB of RAM in 2005. In 20 years it’s gone up by a factor of 8, but only by a factor of 2 in the past 10 years.

I can kind of uncomfortably run a 24B parameter model on my MacBook Pro. That’s something like 50-200X smaller (depending on quantization) than a 1T parameter model.

We’re a _long_ way from having enough RAM (let alone RAM in the GPU) for this size of model. If the 8x / 20 years holds, we’re talking 40-60 years. If 2X / 10 years holds, we’re talking considerably longer. If the curve continues to flatten, it’s even longer.

Not to dampen anyone’s enthusiasm, but let’s be realistic about hardware improvements in the 2010s and 2020s. Smaller models will remain interesting for a very long time.

robotresearcher · 2025-11-07T02:47:09 1762483629

Moore’s Law is about transistor density, not RAM in workstations. But yes, density is not doubling every two years any more.

RAM growth slowed in laptops and workstations because we hit diminishing returns for normal-people applications. If local LLM applications are in demand, RAM will grow again.

RAM doubled in Apple base models last year.

pshirshov · 2025-11-06T17:32:33 1762450353

> The ideal case would be something that can be run locally, or at least on a modest/inexpensive cluster.

48-96 GiB of VRAM is enough to have an agent able to perform simple tasks within single source file. That's the sad truth. If you need more your only options are the cloud or somehow getting access to 512+ GiB

twotwotwo · 2025-11-06T21:14:41 1762463681

I think there is a lot of progress on efficient useful models recently.

I've seen GLM-4.6 getting mention for good coding results from a model that's much smaller than Kimi (~350b params) and seen it speculated that Windsurf based their new model on it.

This Kimi release is natively INT4, with quantization-aware training. If that works--if you can get really good results from four-bit parameters--it seems like a really useful tool for any model creator wanting efficient inference.

DeepSeek's v3.2-Exp uses their sparse attention technique to make longer-context training and inference more efficient. Its output's being priced at 60% less than v3.1 (though that's an imperfect indicator of efficiency). They've also quietly made 'thinking' mode need fewer tokens since R1, helping cost and latency.

And though it's on the proprietary side, Haiku 4.5 approaching Sonnet 4 coding capability (at least on benches Anthropic released) also suggests legitimately useful models can be much smaller than the big ones.

There's not yet a model at the level of any of the above that's practical for many people to run locally, though I think "efficient to run + open so competing inference providers can run it" is real progress.

More important it seems like there's a good trendline towards efficiency, and a bunch of techniques are being researched and tested that, when used together, could make for efficient higher-quality models.

benjiro · 2025-11-06T21:43:51 1762465431

What i do not understand is why we are not seeing specialized models that go down to single experts.

I do not need models that know how to program in Python, Rust, ... when i only use Go and Html. So we are we not seeing models that have very specialized experts, where for instance:

* General interpreter model, that holds context/memory * Go Model * Html model if there is space in memory. * SQL model if there is space in memory.

If there is no space, the GIM swamp out the Go model, for the HTML model, depending on where it is in Agent tasks or Edit/Ask code its overviewing.

Because the models are going to be very small, switching in and out of memory will be ultra fast But most of the time we get very big Expert models, that still are very generalized over a entire field.

This can then be extended that if you have the memory, models combine their output with tasks... Maybe i am just too much of a noob in the field of understanding how LLMs work, but it feels like people are too often running after large models that companies like Anthropic/OpenAI etc deploy. I understand why those big companies use insane big models. They have the money to load them up over a cluster, have the fast interconnect, and for them its more efficient.

But from the bits and pieces that i see, people are more and more going to tons of small 1 a 2B models to produce better results. See my argument above. Like i said, never really gone beyond paying for my CoPilot subscription and running a bit of Ollama at home (don't have the time for the big stuff).

EMM_386 · 2025-11-06T23:13:47 1762470827

I think one of the issues is that LLMs can't have a "Go" model and an "HTML model". I mean, they can but what would that contain? It's not the language-specific features that make models large.

When models work on your code base, they do not "see" things like this, which is why they can go through an entire code base with variable names they have never seen before, function signatures they have never seen before, and directory structures that have never seen before and not have a problem.

You need that "this is a variable, which is being passed to a function which recursively does ..." part. This is not something language specific, it's the high level understanding of how languages and systems operate. A variable is a variable whether in JavaScript or C++ and LLMs can "see" it as such. The details are different but it's that layer of "this is a software interface", "this is a function pointer" is outside of the "Go" or "Python" or "C#" model.

I don't know how large the main model would have to be vs. the specialized models in order to pick this dynamic up.

alansaber · 2025-11-07T00:11:26 1762474286

You wont win much performance with a specific coding language tokenizer/vocabulary, everything else benefits from a larger model size. You can get distilled models that will out-perform or compete with your single domain coding model

pzo · 2025-11-06T18:03:41 1762452221

Even if pay-to-play companies like moonshootai help to pay less.

You can run previous kimi k2 non-thinking model e.g. on groq with 720tok/s and for $1/$3 for million input/output tokens. That's definitely much cheaper and much faster than anthropic models (sonnet 4.5: 60tok/s, $3/$15)

esafak · 2025-11-06T22:00:13 1762466413

If NVIDIA had any competition we'd be able to run these larger models at home by now instead of being saddled with these 16GB midgets.

selectodude · 2025-11-07T01:07:10 1762477630

NVIDIA has tons of competition on inference hardware. They’re only a real monopoly when it comes to training new ones.

And yet…

esafak · 2025-11-07T02:31:33 1762482693

Those are for the enterprise. In the context of discussion, end users only have Apple, AMD, and Nvidia.

pama · 2025-11-07T02:21:13 1762482073

It is not clear that a simple/small model with inference running on home hardware is energy or cost efficient compared to the scaled up inference of a large model with batch processing. There are dozens of optimizations possible when splitting an LLM on multiple tiny components on separate accelerator units and when one handles kv cache optimization at the data center level; these are simply not possible at home and would be a waste of effort and energy until you serve thousands to millions of requests in parallel.

maciejgryka · 2025-11-06T19:08:35 1762456115

I think it’s going to be a while before we see small models (defined roughly as “runnable on reasonable consumer hardware”) do a good job at general coding tasks. It’s a very broad area! You can do some specific tasks reasonably well (eg I distilled a toy git helper you can run locally here https://github.com/distil-labs/gitara), but “coding” is such a big thing that you really need a lot of knowledge to do it well.

andai · 2025-11-06T23:46:26 1762472786

I used to be obsessed with what's the smartest LLM, until I tried actually using them for some tasks and realized that the smaller models did the same task way faster.

So I switched my focus from "what's the smartest model" to "what's the smallest one that can do my task?"

With that lens, "scores high on general intelligence benchmarks" actually becomes a measure of how overqualified the model is, and how much time, money and energy you are wasting.

alansaber · 2025-11-07T00:04:04 1762473844

What kind of task. Simple nlp, sure. Multi-hop or complex? Bigger is better.

anabis · 2025-11-07T00:54:01 1762476841

>The ideal case would be something that can be run locally, or at least on a modest/inexpensive cluster.

It's obviously valuable, so it should be coming. I expect 2 trends:

- Local GPU/NPU will have a for-LLM version that has 50-100GB VRAM and runs MXFP4 etc.

- Distillation will come for reasoning coding agents, probably one for each tech stack (LAMP, Android app, AWS, etc.)x business domain (gaming, social, finance, etc.)

pdyc · 2025-11-06T17:45:08 1762451108

I think that's where prompt engineering would be needed. Bigger models produce good output even with ambiguous prompts. Getting similar output from smaller models is art,

0xjmp · 2025-11-06T19:44:01 1762458241

This happens top down historically though, yes?

Someone releases a maxed out parameter model. Another distillates it. Another bifurcates it. With some nuance sprinkled in.

ares623 · 2025-11-06T18:25:47 1762453547

I don't understand. We already have that capability in our skulls. It's also "already there", so it would be a waste to not use it.

HarHarVeryFunny · 2025-11-06T19:30:44 1762457444

Software development is one of the areas where LLMs really are useful, whether that's vibe coding disposable software, or more structured use for serious development.

I've been a developer for 40+ years, and very good at it, but for some tasks it's not about experience or overcoming complexity - just a bunch of grunt work that needs to come together. The other day I vibe coded a prototype app, just for one-time demo use, in less than 15 min that probably would have taken a week to write by hand, assuming one was already familiar with the tech stack.

Developing is fun, and a brain is a terrible thing to waste, but today not using LLMs where appropriate for coding doesn't make any sense if you value your time whatsoever.

tonyhart7 · 2025-11-07T12:28:53 1762518533

"I don't understand. We already have that capability in our skulls. It's also "already there", so it would be a waste to not use it."

seems like you are here that not understand this

Company want to replace human and won't need to pay massive salary

ares623 · 2025-11-07T18:30:14 1762540214

I understand the companies wanting it. I hate it, but I understand.

I don’t understand the humans wanting to be replaced though.

tonyhart7 · 2025-11-07T23:02:59 1762556579

"I don’t understand the humans wanting to be replaced though."

because human that replace these job isnt the same human that got cut????

human that can replace these jobs would be rich

wordpad · 2025-11-06T19:45:47 1762458347

The electricity cost to run these models locally is already more than equivalent API cost.

HarHarVeryFunny · 2025-11-06T21:01:11 1762462871

That's going to depend on how small the model can be made, and how much you are using it.

If we assume that running locally meant running on a 500W consumer GPU, then the electricity cost to run this non-stop 8 hours a day for 20 days a month (i.e. "business hours") would be around $10-20.

This is about the same as OpenAI or Anthropics $20/mo plans, but for all day coding you would want their $100 or $200/mo plans, and even these will throttle you and/or require you to switch to metered pricing when you hit plan limits.

wordpad · 2025-11-12T22:51:50 1762987910

Neither $20 nor $200 plans cover any API costs.

At $0.17 per million tokens for the smallest gpt model that's still faster rand more powerful than anything you can run locally and cheaper in kilowatts per hour than it would cost you to run locally even if you could.

ImPostingOnHN · 2025-11-06T19:46:44 1762458404

Privacy is minimally valued by most, but not by all.