They are priced as if they are the only ones who are capable of creating chips t...

jerf · on March 18, 2024

This seems as good a place as any to be Corrected by the Internet, so... correct me if I'm wrong.

Making a graphics chip that is as good as Nvidia: Very difficult. Huge moat, huge effort, lots of barriers, lots of APIs, lot of experience, lots of decades of experience to overcome.

Making something that can run a NN: Much, much easier. I'd guess, start-up level feasible. The math is much simpler. There's a lot of it, but my biggest concern would be less about pulling it off and more around whether my custom hardware is still the correct custom hardware by the time it is released. You'd think you could even eke out a bit of a performance advantage in not having all the other graphics stuff around. LLMs in their current state are characterized by vast swathes of input data and unbelievably repetitive number crunching, not complicated silicon architectures and decades-refined algorithms. (I mean, the algorithms are decades refined, but they're still simple as programs go.)

I understand nVidia's graphics moat. I do not understand the moat implied by their stock valuation, that as you say, they are the only people who will ever be able to build AI hardware. That doesn't seem remotely true.

So... correct me Internet. Explain why nVidia has persistent advantages in the specific field of neural nets that can not be overcome. I'm seriously listening, because I'm curious; this is a deliberate Cunningham's Law invocation, not me speaking from authority.

smallmancontrov · on March 18, 2024

I agree with you, but let me devil's advocate.

After 10 years of pretending to care about compute, AMD has filled the industry with burned-once experts who, when weighing nvidia against competitors, instinctively include "likely boondoggle" against every competitor's quote because they've seen it happen, possibly several times. Combine this with nvidia's deep experience and and huge rich-get-richer R&D budget keeping them always one or two architecture and software steps ahead, like it did in graphics, and their rich-get-richer TSMC budget buying them a step ahead in hardware, and you have a scenario where it continues makes sense to pay the green tax for the next generation or three. Red/blue/other rebels get zinged and join team "just pay the green tax." NV continues to dominate. Competitors go green with envy, as was fortold.

htrp · on March 18, 2024

> burned-once experts

More like burned 2x / 3x / 4x of this time it's different people.

Looking at you Intel

jerf · on March 19, 2024

It's true that nobody has beaten nVidia yet, and that is a valid data point I don't deny.

But (as a reply to some other repliers as well), AMD was also chasing them on the entire graphics stack as well as compute. That is trying to cross the moat. Even reimplementing CUDA as a whole is trying to cross a moat, even a smaller one.

But just implementing a chip that does AI, as it stands today, full stop, seems like it would be a lot easier. There's a lot of people doing it and I can't imagine they're all going to fail. I would consider by far the more likely scenario to be that the AI research community finds something other than neural nets to run on and thus the latest hotness becomes something other than a neural net and the chips become much less relevant or irrelevant.

And with the valuation of nVidia basically being based not on their graphics, or CUDA, but specifically just on this one feeding frenzy of LLM-based AI, it seems to me there's a lot of people with the motivation to produce a chip that can do this.

lmm · on March 18, 2024

> So... correct me Internet. Explain why nVidia has persistent advantages in the specific field of neural nets that can not be overcome. I'm seriously listening, because I'm curious; this is a deliberate Cunningham's Law invocation, not me speaking from authority.

To become a person who writes driver infrastructure for this sort of thing, you need to be a smart person who commits, probably, several of their most productive years to becoming an expert in a particular niche skillset. This only makes sense if you get a job somewhere that has a proven commitment of taking driver work seriously and rewarding it over multiple years.

NVidia is the only company in history that has ever written non-awful drivers, and therefore it's not so implausible to believe that it might be the only company that can ever hire people who write non-awful drivers, and will continue to be the only company that can write non-awful drivers.

bgnn · on March 18, 2024

CUDA is/was their biggest advantage to be honest, not the HW. They saw the demand to super high-end GPUs driven by Bitcoin mining craze thanks to CUDA, and it transitioned gracefully to AI/ML workloads. Google was much more ahead to see the need and develop TPUs for example.

I don't think they have a crazy advantage HW wise. Couple of start-ups are able to achieve this. If SW infrastracture end is standardized, we will have a more level playground.

elorant · on March 18, 2024

CUDA is a big reason for their moat. And that's not something you can build in a couple of years no matter how money you can throw on it.

Without CUDA you have a chip that runs on premise without anyone having a clue how good that is which is supposedly what Google does. Your only offering is cloud services. As big as this is, corporations would want to build their own datacenters.

sottol · on March 18, 2024

Sure, CUDA has a lot of highly optimized utilities baked-in (CUDNN and the likes) and maybe more importantly, implementors have a lot of experience with it but afaict everyone is working on their own HAL/compiler and not using CUDA directly to implement the actual models. It's part of the HAL/framework. You can probably port any of these frameworks to a new hardware platform with a few man-years worth of work imo if you can spare the manpower.

I think nobody had the time to port any of these architectures away from CUDA because: * the leaders want to maintain their lead and everyone needs to catch up asap so no time to waste, * and progress was _super_ fast so doubly no time to waste, * there was/is plenty of money that buys some perceived value in maintaining the lead or catching up.

But imo: 1. progress has slowed a bit, maybe there's time to explore alternatives, 2. nvidia GPUs are pretty hard to come by, switching vendors may actually be a competitive advantage (if performance/price pans out and you can actually buy the hardware now as opposed to later).

In terms of ML "compilers"/frameworks, afaik there's:

* Google JAX/Tensorflow XLA/MLIR, * OpenAI Triton, * Meta Glow, * Apple PyTorch+Metal fork.

sangnoir · on March 19, 2024

> CUDA is a big reason for their moat.

Zen 1 showed that absolute performance is not the end-all metric ( Zen lost on single-core performance vs Intel). A lot of people care for bang-for-buck metric. If AMD can squeak out good-enough drivers for cards with good-enough performance for a TCO[1] significantly lower than NVidia, they break Nvidia's current positive feedback cycle.

1. Initial cost and cooling - I imagine for AI data center usage, opex exceeds capex.

__mharrison__ · on March 19, 2024

Anecdata... one of the folks sitting in front of me at a session at GTC claimed the be an AMD employee who also claimed to previously work on cuda. He seemed skeptical that AMD would pull this off. This is the sort of fun stuff that you hear at a conference and aren't sure how much of it is just technical bragging/oneupmanship.

imtringued · on March 19, 2024

It doesn't. If NVIDIA doesn't work with SK Hynix to integrate PIM GDDR into their products they are going to die, because processing in memory is already a thing and it is faster and more scalable than GPU based inference.

drexlspivey · on March 19, 2024

AMD is even more hilariously overvalued, currently at 360 PE

belter · on March 18, 2024

Good luck with that. Gemini Advanced is simply unusable right now....It's so bad its hard to believe nobody picked up on that yet.

belter · on March 18, 2024

Go to Gemini Advanced and try a common programming task in Parallel with Claude and ChatGPT4. Within 2 prompts Claude and ChatGPT4 will give nice working code you can use as a basis while Gemini Advanced will ignore your prompts, provide partial code and quickly tell you it can do more, until you tell it exactly what you want. It will go from looking usable to stuck on "I can do A or I can do B you tell me what you prefer hell" in less than 2 or 3 prompts...Unusable. And I say that as paying customer that will soon cancel the service.

Workaccount2 · on March 18, 2024

You're not wrong, but it wouldn't be surprising if Google irons things out with a few more updates. The point is that it would be foolish to write off Gemini right now, and Gemini is totally independent of Nvidia's dominance.