That's what I tried to explain then as well, and i brought up stuff like path-finding algorithms for route-finding (A*/heuristic-search) as an more old-school AI part, which didn't really land I think.
> Not really stochastic as far as I know. The whole random seed and temperature thing is a bit of a grey area for my full understanding. Let alone the topk, top p, etc. I often just accept what's recommended from the model folks.
I mean LLMs are often treated in stochastic nature, but like ML models aren't usually? Like maybe you have some dropout, but that's usually left out during inference AFAIK. I dont think a Resnet or YOLO is very stochastic, but maybe someone can correct me.
> AI for the most part has been out a couple years.
With this you just mean LLMs right? Because I understand AI to be way more then just LLMs & ML
yeah, stochastic is there because we give up control of order of operations for speed
so the order in which floating-point additions happen is not fixed because of how threads are scheduled, how reductions are structured (tree reduction vs warp shuffle vs block reduction)
Floating-point addition is not associative (because of rounding), so:
- (a + b) + c can differ slightly from a + (b + c).
- Different execution orders → slightly different results → tiny changes in logits → occasionally different argmax token.
> someone corrected me above, it does seem to matter more then I thought
if you llm agent takes different decisions from the same prompt, then you have to deal with it
1) your benchmarks become stochastic so you need multiple samples to get confidence for your AB testing
2) if your system assumes at least once completion you have to implement single record and replay so you dont get multiple rollout of with different actions
Building LLM’s. In my mind those engineers are the ones that have more intimate knowledge of the data and input, and can create the LLM’s for their specific tasks. Everyone else is a customer to them.
I can tell you how a house is built, that doesn’t make me a builder that makes me informed and opinionated.
I can decorate my house however I like but im not a painter/decorator or a tradesman. I can assemble some ikea furniture, but I’m not a carpenter. I’m a consumer and I can tweak something to my liking but I can’t do anything significant.
I try and frame things from an agency perspective.
Agencies are like a production line, they need raw materials coming in; clients with cash, armed with opportunities, scraps of ideas or formed briefs to be worked on. They need this business so they can generate the output and keep the lights on.
AI is everywhere and everything for a lot of people now. You can be sure that Exec’s are asking their teams how are we using AI, how is it helping the business grow etc. However there’s so much AI news, it’s moving so quick and seeping into everything that difficult (from a naïeve client point of view) to know what’s fantasy and what’s reality.
So my perception is…
agencies do the sifting and maintain visibility of what is real or not because they have to start drumming up future sales and business, and AI is hot right now.
Perhaps they have some training in CoPilot etc, or with some experience of creating a model, maybe they have integrated something small with something big.
It may even be that being ann angency they have a more open way of working that a corporate does, and that’s the sell.
Anyway, the sales teams will proclaim themselves experts because they have to sell.
If the alternative is a Linux-distro, likely UX won't be much better/more-consistent when applications use different UI kits/styles etc.
Even Though Apple is doing a shitty job with their walled garden, a garden is still more organized than a jungle of different distro's/applications/frameworks/etc.
Depends on how heavy one wants to go with the quants (for Q6-Q4 the AMD Ryzen AI MAX chips seem better/cheaper way to get started).
Also the Mac Studio is a bit hampered by its low compute-power, meaning you really can't use a 100b+ dense model, only MoE feasibly without getting multi minute prompt-processing times (assuming 500+ tokens etc.)
Huh? My maxed out Mac Studio gets 60-100 tokens per second on 120B models, with latency on the order of 2 seconds.
It was expensive, but slow it is not for small queries.
Now, if I want to bump the context window to something huge, it does take 10-20 seconds to respond for agent tasks, but it’s only 2-3x slower than paid cloud models, in my experience.
Still a little annoying, and the models aren’t as good, but the gap isn’t nearly as big as you imply, at least for me.
I think the Mac Studio is a poor fit for gpt-oss-120b.
On my 96 GB DDR5-6000 + RTX 5090 box, I see ~20s prefill latency for a 65k prompt and ~40 tok/s decode, even with most experts on the CPU.
A Mac Studio will decode faster than that, but prefill will be 10s of times slower due to much lower raw compute vs a high-end GPU. For long prompts that can make it effectively unusable. That’s what the parent was getting at. You will hit this long before 65k context.
If you have time, could you share numbers for something like:
I think the only exception is specifically for studying network/communciation-topologies.
I've seen a couple clusters (ca. 10-50 Pi's) in universities for both research and teaching.
Sounds super interesting, where do you put these diagrams ?
It's an issue I'm seeing even for comments touching too much on algorithmic stuff. To take a somewhat common example, if you were dealing with a credit card payment flow, where would the explanation of how a transaction goes through a few states asynchronously, which all trigger a webhook callback ?
Obviously the people working on the code need to be aware of that, so documentation is somewhere needed. I've seen people put whole blocks in class headers, other sprinkle it all inside the code, personally I ended up moving it outside of the code. Where would you put it?
I personally just throw them at the top of my files as long block-comments, or sometimes inside/around very heavy functions. For example i often add little diagrams for when dealing with some bit-fiddly logic parts to easier visualize the bit-layouts.
But for architecture, either a whole text-file for it or at the top of the module
Thanks! Do you deal with the logic getting split/shared around the code ? For instance on the credit card example there will be probably be one central class (the transaction class?) but you'd need to know the whole logic in the card registration part or the webhooks as well. I guess you don't stick a diagram everywhere ?
On one hand, this could provide a lot of value as some things are just plain hard to explain using only words. On the other hand, aren't you worried about when someone else comes along and needs to update one of those comments? If they're not aware of this tool, it's either going to be incredibly tedious or simply not going to happen.
As the other commenters put it, i dont think this is a huge issue.
I usually use this for architecture level diagrams, and that shouldn't change often/at-all. In-case it does change, doing a new diagram is perfectly in-scope of whoevers working on that.
Looks like Monodraw a mac only BTW. That should be fine if macs are mandatory for all the devs on a project, but it would otherwise create a kinda weird situation.
Funnily, they're far from being optimal for GEMM ops (especially in terms of power consumption).
For GEMM you need to visit each row/vec n-times so theres a bunch of data-reuse going on, which isn't optimal for GPUs since you can't keep that all so close to your processing-units. And while the tensor-cores kinda implement this i think they don't quite scale up to a full sized systolic array, which is you would want for larger matrix multiplications.
Also just a simpler view: with GPUs most of their silicon is spent NOT tensor-core, so just from that you know its not optimal i guess.
Just referring to that FLOP/s number doesn't really mean much nowadays with tensor-cores and sparsity.
In my eyes the big win of GPUs are that not only are they pretty good at GEMMs but also really good at a lot of other easily parallelizable tasks PLUS they're comparatively easy to program ^^
And also just to nitpick/joke:
> More accurately, it is neural networks which are more "stochastic" with their predictions and decisions <...>
I would defend NNs to not even be necessarily stochastic. I had to handwrite weights for NNs in atleast two exams, to fit XOR for example ;)