Hacker Newsnew | past | comments | ask | show | jobs | submit | more randomgermanguy's commentslogin

Okay thanks for saving my sanity somewhat.

And also just to nitpick/joke:

> More accurately, it is neural networks which are more "stochastic" with their predictions and decisions <...>

I would defend NNs to not even be necessarily stochastic. I had to handwrite weights for NNs in atleast two exams, to fit XOR for example ;)


that may be the exception that proves the rule here though. Outside of the tiniest toy example is this ever true?


> machine learning is the sub field of AI.

That's what I tried to explain then as well, and i brought up stuff like path-finding algorithms for route-finding (A*/heuristic-search) as an more old-school AI part, which didn't really land I think.

> Not really stochastic as far as I know. The whole random seed and temperature thing is a bit of a grey area for my full understanding. Let alone the topk, top p, etc. I often just accept what's recommended from the model folks.

I mean LLMs are often treated in stochastic nature, but like ML models aren't usually? Like maybe you have some dropout, but that's usually left out during inference AFAIK. I dont think a Resnet or YOLO is very stochastic, but maybe someone can correct me.

> AI for the most part has been out a couple years.

With this you just mean LLMs right? Because I understand AI to be way more then just LLMs & ML


yeah, stochastic is there because we give up control of order of operations for speed

so the order in which floating-point additions happen is not fixed because of how threads are scheduled, how reductions are structured (tree reduction vs warp shuffle vs block reduction)

Floating-point addition is not associative (because of rounding), so: - (a + b) + c can differ slightly from a + (b + c). - Different execution orders → slightly different results → tiny changes in logits → occasionally different argmax token.


Actually, that's a misconception. It's because of varying batch sizes that requests get scheduled on: https://thinkingmachines.ai/blog/defeating-nondeterminism-in...


Oh actually yeah that's true. You have correctly out-nitpicked my nitpick lol.

But at that point i feel like we are getting close to "everything that isn't a perfect Turing-machine is somewhat-stochastic" ;)

Edit: someone corrected me above, it does seem to matter more then I thought


> someone corrected me above, it does seem to matter more then I thought

if you llm agent takes different decisions from the same prompt, then you have to deal with it

1) your benchmarks become stochastic so you need multiple samples to get confidence for your AB testing

2) if your system assumes at least once completion you have to implement single record and replay so you dont get multiple rollout of with different actions


What do you mean with "building this stuff"? As in building LLMs, or building applications on-top of them.


Building LLM’s. In my mind those engineers are the ones that have more intimate knowledge of the data and input, and can create the LLM’s for their specific tasks. Everyone else is a customer to them.

I can tell you how a house is built, that doesn’t make me a builder that makes me informed and opinionated. I can decorate my house however I like but im not a painter/decorator or a tradesman. I can assemble some ikea furniture, but I’m not a carpenter. I’m a consumer and I can tweak something to my liking but I can’t do anything significant.


But why do you think this is? Like is it just the money/status that comes with calling yourself an "AI"-expert ?


I try and frame things from an agency perspective.

Agencies are like a production line, they need raw materials coming in; clients with cash, armed with opportunities, scraps of ideas or formed briefs to be worked on. They need this business so they can generate the output and keep the lights on.

AI is everywhere and everything for a lot of people now. You can be sure that Exec’s are asking their teams how are we using AI, how is it helping the business grow etc. However there’s so much AI news, it’s moving so quick and seeping into everything that difficult (from a naïeve client point of view) to know what’s fantasy and what’s reality.

So my perception is… agencies do the sifting and maintain visibility of what is real or not because they have to start drumming up future sales and business, and AI is hot right now.

Perhaps they have some training in CoPilot etc, or with some experience of creating a model, maybe they have integrated something small with something big. It may even be that being ann angency they have a more open way of working that a corporate does, and that’s the sell.

Anyway, the sales teams will proclaim themselves experts because they have to sell.


If the alternative is a Linux-distro, likely UX won't be much better/more-consistent when applications use different UI kits/styles etc.

Even Though Apple is doing a shitty job with their walled garden, a garden is still more organized than a jungle of different distro's/applications/frameworks/etc.

(at least in my limited experience)


Made me instantly think of this: https://www.nationalgeographic.com/science/article/slime-mou...

Something ironically self-referential about not only fungi growing (like) rail networks, but also IN rail networks.


Depends on how heavy one wants to go with the quants (for Q6-Q4 the AMD Ryzen AI MAX chips seem better/cheaper way to get started).

Also the Mac Studio is a bit hampered by its low compute-power, meaning you really can't use a 100b+ dense model, only MoE feasibly without getting multi minute prompt-processing times (assuming 500+ tokens etc.)


Given the RAM limitations of the first gen Ryzen AI MAX, you have no choice but to go heavy on the quantization of the larger LLMs on that hardware.


Huh? My maxed out Mac Studio gets 60-100 tokens per second on 120B models, with latency on the order of 2 seconds.

It was expensive, but slow it is not for small queries.

Now, if I want to bump the context window to something huge, it does take 10-20 seconds to respond for agent tasks, but it’s only 2-3x slower than paid cloud models, in my experience.

Still a little annoying, and the models aren’t as good, but the gap isn’t nearly as big as you imply, at least for me.


GPT OSS 120B only has 5B active parameters. GP specifically said dense models, not MoE.


I think the Mac Studio is a poor fit for gpt-oss-120b.

On my 96 GB DDR5-6000 + RTX 5090 box, I see ~20s prefill latency for a 65k prompt and ~40 tok/s decode, even with most experts on the CPU.

A Mac Studio will decode faster than that, but prefill will be 10s of times slower due to much lower raw compute vs a high-end GPU. For long prompts that can make it effectively unusable. That’s what the parent was getting at. You will hit this long before 65k context.

If you have time, could you share numbers for something like:

llama-bench -m <path-to-gpt-oss-120b.gguf> -ngl 999 -fa 1 --mmap 0 -p 65536 -b 4096 -ub 4096

Edit: The only Mac Studio pp65536 datapoint I’ve found is this Reddit thread:

https://old.reddit.com/r/LocalLLaMA/comments/1jq13ik/mac_stu ...

They report ~43.2 minutes prefill latency for a 65k prompt on a 2-bit DeepSeek quant. Gpt-oss-120b should be faster than that, but still very slow.


This is Mac Studio M1 Ultra with 128Gb of RAM.

  > llama-bench -m ./gpt-oss-120b-MXFP4-00001-of-00002.gguf -ngl 999 -fa 1 --mmap 0 -p 65536 -b 4096 -ub 4096       
                                                                                             
  | model                          |       size |     params | backend    | threads | n_batch | n_ubatch | fa | mmap |            test |                  t/s |
  | ------------------------------ | ---------: | ---------: | ---------- | ------: | ------: | -------: | -: | ---: | --------------: | -------------------: |
  | gpt-oss 120B MXFP4 MoE         |  59.02 GiB |   116.83 B | Metal,BLAS |      16 |    4096 |     4096 |  1 |    0 |         pp65536 |       392.37 ± 43.91 |
  | gpt-oss 120B MXFP4 MoE         |  59.02 GiB |   116.83 B | Metal,BLAS |      16 |    4096 |     4096 |  1 |    0 |           tg128 |         65.47 ± 0.08 |
  
  build: a0e13dcb (6470)


Thanks. That’s better than I expected. It's only 8.3x worse than a 5090 + CPU: 167s latency.


I think the only exception is specifically for studying network/communciation-topologies. I've seen a couple clusters (ca. 10-50 Pi's) in universities for both research and teaching.


There are so many network emulators you can use, such as Mininet or GNS3.


I'm sure pedagogically speaking it's better to use physical devices


Bought this couple months ago, and am now always looking for more ways to include this for inline-documentation.

the fact i can export to clipboard and re-import it and reconstruct all the shapes etc. almost flawlessly is such a big win.


Absolutely love monodraw for diagrams in documentation! All of the diagrams for Oban and Oban Pro are done this way:

Job Lifecycle: https://hexdocs.pm/oban/job_lifecycle.html

Composition: https://oban.pro/docs/pro/1.6.4/composition.html


Sidenote: thanks so much for taking the time to write the Oban docs. I'm a big user (and fan) of Oban, and the docs are fantastic.


Sounds super interesting, where do you put these diagrams ?

It's an issue I'm seeing even for comments touching too much on algorithmic stuff. To take a somewhat common example, if you were dealing with a credit card payment flow, where would the explanation of how a transaction goes through a few states asynchronously, which all trigger a webhook callback ?

Obviously the people working on the code need to be aware of that, so documentation is somewhere needed. I've seen people put whole blocks in class headers, other sprinkle it all inside the code, personally I ended up moving it outside of the code. Where would you put it?


I personally just throw them at the top of my files as long block-comments, or sometimes inside/around very heavy functions. For example i often add little diagrams for when dealing with some bit-fiddly logic parts to easier visualize the bit-layouts. But for architecture, either a whole text-file for it or at the top of the module


Thanks! Do you deal with the logic getting split/shared around the code ? For instance on the credit card example there will be probably be one central class (the transaction class?) but you'd need to know the whole logic in the card registration part or the webhooks as well. I guess you don't stick a diagram everywhere ?


On one hand, this could provide a lot of value as some things are just plain hard to explain using only words. On the other hand, aren't you worried about when someone else comes along and needs to update one of those comments? If they're not aware of this tool, it's either going to be incredibly tedious or simply not going to happen.


As the other commenters put it, i dont think this is a huge issue. I usually use this for architecture level diagrams, and that shouldn't change often/at-all. In-case it does change, doing a new diagram is perfectly in-scope of whoevers working on that.


Add a one line comment stating that it was edited by monodraw.


Looks like Monodraw a mac only BTW. That should be fine if macs are mandatory for all the devs on a project, but it would otherwise create a kinda weird situation.


Since they're text files, you can also say "Please copy to a ASCII diagram editor and update there (e.g. Monodraw, asciiflow, etc.)".


> am now always looking for more ways to include this for inline-documentation.

same lol. here is a blog post of mine where I used them - https://avi.im/blag/2024/disaggregated-storage

I had to convert them to images because I couldn't get to working with Hugo, static site generator


Funnily, they're far from being optimal for GEMM ops (especially in terms of power consumption).

For GEMM you need to visit each row/vec n-times so theres a bunch of data-reuse going on, which isn't optimal for GPUs since you can't keep that all so close to your processing-units. And while the tensor-cores kinda implement this i think they don't quite scale up to a full sized systolic array, which is you would want for larger matrix multiplications.

Also just a simpler view: with GPUs most of their silicon is spent NOT tensor-core, so just from that you know its not optimal i guess.

Just referring to that FLOP/s number doesn't really mean much nowadays with tensor-cores and sparsity.

In my eyes the big win of GPUs are that not only are they pretty good at GEMMs but also really good at a lot of other easily parallelizable tasks PLUS they're comparatively easy to program ^^


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: