Hacker News new | past | comments | ask | show | jobs | submit login
Maintaining large-scale AI capacity at Meta (fb.com)
106 points by samber 10 months ago | hide | past | favorite | 59 comments



Meta is going hard into AI (both hardware and software), which is great to see. Something that's not super obvious is what specific features of existing apps require AI, that is, how will Meta get return on investment?

Two uses I can think of are i) text and image content moderation on fb and instagram (won't need as many human reviewers if bots are as/more effective), and ii) chatbots for businesses (businesses could provide their business documentation to a meta LLM which could handle customer inquiries via messenger and whatsapp).

Anything else?


Meta actually has a whole separate, existing AI research and use case for targeting ads that has seen much better results as their AI capabilities have improved. I don't think gen AI is used for this in the way most commenters think, but the improvements in AI architecture / infra, training, etc. are all helpful to the AI which helps ad targeting while simultaneously building more powerful Gen AI


Fake profiles to boost engagement/DAU? Grandma is lonely on FB now that no one is on there anymore.


This isn't it. Actually FB, the app specifically, is having an unexpected uptick in users in the younger demographics, partly from their strategy of using FB marketplace and other ancillary services to bring people back to the main app.

EDIT: Also, it's very easy to creat fake profiles on FB, and other people do it all the time. Meta don't need to do it themselves


Meta is building and scaling the infrastructure for AI and then they can resell that at a premium later. All these articles do is highlight the challenges of rolling out your own infra, if Meta solves these issues it becomes a reference and get a bigger piece of the AI pie.

They want to become the AI backend for the Fortune 500


Their whole advertising business model gets better with LLM understanding of text. They can target ads better.


This is fair guess on intuition but working in recommender space on both content/ad recommendation, content understanding signals have pretty consistently across two companies and many projects tended to underwhelm and key signals are generally engagement signals (including event sequences) and many embeddings (user embedding, creator embedding, ad embedding, etc).

The main place I’ve seen content understanding help is coldstart specially for new items by new creators.


And you wonder sometimes if the products being advertised, themselves, couldn't matter more than the targets reading them. We've learned to recognize crappy offering and AI can try to make me read more and more ads relevant to what I'm saying, if it's a crap product, I won't pay anyway :(


They could make customer facing support bots for every business with a Facebook page


Generative environments on demand for their VR goggles seems like something that could drive net new revenue some day, both hardware and subscriptions. If they can find a PR safe way to provide adult content that would be even better.

Or, virtual conversational humans for their boomer userbase could be a hit.


Even if this scale is massive and second-to-none, it’s funny how some issues are the same for all of us. In particular “Bad hosts are very bad” aka “a chain is only as strong as its weakest link” can happen with as little as a few (4) machines, and then ruin your day.


This is really unique to the AI training clusters for reasons I'm not super clear on. Most other types of horizontally scaled workloads can sort of tolerate a slightly underperforming host, or hosts going bad every so often, with little P99/P99.9 impact. For some reason, AI training workloads really cannot.


When training a large model you're doing a single forward pass + backprop over multiple Infiniband-connected nodes for a single model instance, so if one node goes down it takes a logical unit of nodes down with it. For reference, GPT-4 was rumored to be around 1.7T, and doing some back-of-the-hand math[1], that's like 500-700 H100 GPUs per model instance, which means you need a multiple of that for any training parallelism whatsoever.

[1] back-of-the-hand-math: 1.7T * 4 bytes = 6.8 TB; 3-4x that for activation + gradients = 27.2 TB; 27.2TB / (80GB / H100) = 349 H100s; 1.5-2x conservative multiplier accounting for not fully using node resources + memory overhead in the machine = ~500-700 H100s.

truly insane numbers.


That trillion+ parameter count is the sum of each of the "experts", right?


The ever-circulating rumour is 1.7T - 1.8T for the whole thing. But it is not very substantiated, mostly started by SemiAnalysis and geohot based on rather loose speculation (such as API latency and price), and not much solid evidence to confirm it after that.

And of course, it must have changed substantially with GPT-4-Turbo and GPT-4o. It would make sense if the cost reduction was larger than the price reduction, they probably have a higher profit margin now, and the price reduction has been very significant since GPT-4 release.


This is because everyone is training with synchronous sgd. all gpus need to synchronize on each gradient step so tail latency will kill you.


I’ve worked at companies with async training. Async training does help on fault tolerance and also can assist with training thoroughput by being less reliant on slowest machine. It does add meaningful training noise and when we did experiments against sync training we got much more stable results with sync training and some of our less stable models would even sometimes have loss explosions/divergence issues with async training but be fine with sync training.

Although even for async training generally I see dataset just sharded and if worker goes down then shard of data may be loss/skipped not some kind of smarter dynamic file assignment factoring when workers go down. Even basic things like job fails continue from last checkpoint with same dataset state for large epoch is messy when major libraries like tensorflow lack a good dataset checkpointing mechanism.


I’m old enough to remember when companies were eager to claim that their data centers (or some aspect) were finally “carbon neutral”.

Now, with the enormous data center growth for AI purposes, companies don’t even bother pretending that any of this is sustainable.

At best, they might delude themselves into believing that a glorified text autocomplete program will magically solve the world’s problems, including the unsustainability of the machines running the program.


we could exist without any of the modern conveniences. let's tear down the electric grid and return to monke.


We're way past that. Global warming requires (or will, soon enough) heat pumps for survival in many regions of the world. Plenty of regions require large amount of electricity for life critical functions. Degrowth isn't the answer.


Or alternatively, people will need to move to colder regions.


Yes, they will, which will drive conflict, xenophobia and economic destabilization in the countries those people move to, which will exacerbate global political tensions and probably wind up with us all getting wiped out in a nuclear configuration sometime before the century is over, so we might as well have really nice autocomplete before we get there.


Driving up concrete production, a significant carbon contributor as it stands (low carbon concrete is still in its infancy).


Or there are just less people (as we see in developed countries birthrates)


Could you really survive without trucks delivering food to stores on your city? I couldn't.


That’s right (even though you’re probably being facetious).


It's just that investors are now completely over the whole "DEI and ESG" alphabet investing type of phase after seeing that it has not helped companies produce returns at all.


ESGs are still going strong, the whole point is potentially accepting lower returns in exchange for voting with your wallet on what companies you support. Investors, the people who collect money for a living, have never been the target for ESGs.


So I was pondering, NVidia quarterly datacenter revenue is around 18.4 billion. Meaning that the raw cost input to the AI industry is somewhere around 14-24 billion dollars per year post depreciation. This is against known revenues of ~3.8 Billion at OpenAI and ~800 Million at Anthropic. Based on reported revenue at cohere of 20 MM - I think it's a fair assumption that the only other material revenue in the industry is in the applications side either at megacaps or smaller startups targeting various back office tasks.

One could make a bearish claim on NVidia, that their revenue/valuation is unsustainable unless the AI industry grows 100x over the next few years.


Yes. Also, how much can Nvidia revenue go up from here? There are only a small handful of companies that want more than 100k H100s - if there are ten, that’s $30b of GPUs, which Nvidia would sell in less than 6 months, and then who keeps buying?

Or are we expecting a few companies to be buying 1 million plus H100s each next year


What's the failure rate of GPUs over a 10 year period?


You are only calculating some sources. I'm sure other cloud venders like aws are part of the global datacenter spend. As well as many smaller private or government entities. OpenAI will continue to spend more taking a bigger percentage if supply remains flat.


One huge oversight in this analysis is the assumption that LLMs will not get better. Imagine there is a 10% chance that LLM capabilities increase so significantly in the next 5 years that a significant portion (say 1%) of the economy will be automated by them. That's enough to make Nvidia a multi trillion dollar company.


>That's enough to make Nvidia a multi trillion dollar company.

There are tons of chip companies breathing down it's neck. AMD dropped the ball but Huawei Ascend chips are already making a dent and NVIDIA had to drop prices in China. Rest of the industry will start eating into NVIDIA as the market cap grows.


That's a pretty big assumption. We can also assume they do not get much better and extrapolate from that, and I think that it's responsible to analyze both cases and try to ascertain which is more likely.

56% of the American economy is also based on intellectual property so it's also a big claim that the existing status quo will have nothing to say about large tech firms trying to displace that, if it is even good enough to do that.


It's not an assumption, it's an expected value computation. You have to choose a distribution of expected outcomes in terms of future capabilities. Nearly all outcomes that are plausible to me come out with Nvidia worth at least 50% of what's it's worth today.


Raw improvements to computation aren't a guarantee that LLMs improve.


Interesting way to look at it! At least looks like companies are investing with some expected future return that is not yet there today.

Other thoughts:

1. Current revenues might be a bit higher than you calculated. E.g. I’m not sure if copilot and azure OpenAI service are fully included in those revenues, and those might be relevant figures at this scale.

2. The 10-100x growth might actually materialize. Corporates are much slower to adopt and scale a technology than people might expect. As a result, many big potential users are only at the very beginning of adopting AI. (I am assuming there are valuable applications for them to use AI for)


Outside of your usual customer service chat bots I haven’t seen very many real genai applications in enterprises. I led a team that put one in prod that was pretty basic RAG over a knowledge base to answer questions for about 100 specialists internal to my employer. The knowledge base had to be so refined and the system prompt so tuned that any additional content added would blow it up.

There was another bot that popped up in our firm meant to answer questions about corporate policy. I’m guessing that team did something similar but over our policy docs. It vanished about 3 months of being online. It probably gave a bad answer and someone called it out to legal.

I’m curious if anyone else has seen a real application in an enterprise worthy of the hype.


We’re automatically logging and summarizing calls. I’m in a regulated industry so it’s not an optional activity and there are quite some requirements to how and what we log.

Interesting thing we learned is that agents tend to log during the calls, not afterwards. We now see (qualitative feedback) that they are less busy logging and therefore have more attention for the client, and (quantitatively) we see the calls are getting shorter and people with the tool are doing more calls per day. We do many millions of calls a year, so it sums out to a good number.

Similarly, we have processes with 100s analysts with very high standards to their outputs. The traditional way is to have QA teams review and provide feedback for a few rounds. We’re introducing AI for the first round(s) of feedback to shorten the cycle time reduces context switches) and have the QA teams focus on final reviews.

But I get your point. RAG knowledge bases for experts are in a hard spot. After a few months of employment, the experts tend to know the general knowledge well. As a result, the RAG-bot mostly gets questions about exceptions and niches, where it doesn’t perform very well, and mistakes might be expensive.


Thank you for replying. One thing my company (global 700k employees so a big firm) really REALLY doesn’t like is sending corporate IP like code and also competitive things like proposals into an LLM. We use Azure which legal begrudgingly allows but they’d prefer nothing. Remember appliances? I bet there’s a market for a metal box containing an LLM that an enterprise could stick in their data center. “Cloud” is pretty well adopted but something about an LLM API makes enterprises nervous.


People are going to want to use GPT-4o style real time conversational voice AI a lot. I can see people having hour-long conversations on a regular basis within a few years. That application alone is going to need mountains of GPUs, more than are needed for training. And it will be extended to real time video input and output within a few years with another increase in GPU requirements.


If it does play out like this, I feel that inferencing will be taken over by lower-power ASICs (e.g. Groq) or even commodity CPU hardware. GPUs are an expensive solution (both power consumption and capex) to a straightforward problem. There are already real-time TTS models running on ARM, and while the fidelity isn't on par with large GPU models yet, a year or two of hardware improvements and software optimizations will probably close that gap.

The same may not apply to audio2audio models, though, and NVIDIA will probably keep a firm grip on training, too.


Seeing this sentiment more and more. And I share it. All bubbles pop sometime.


AI generating material new revenue at megacaps is likely. If AI currently generates only 5% of Alphabet's revenue, and if 100% of the cost of this revenue is $ spent on Nvidia's hardware, then this represents 15% of Nvidia's revenue.

Seen another way, in this example, Nvidia would only need ~7 Alphabets to sustain current revenue.


It is unsustainable, but no one wants to see the music stop. AI hype is floating a lot of tech stocks right now, and NVIDIA is at the center of it.


If they can keep the narrative, like Elon has kept the Tesla narrative, they'll have made so much money that it doesn't matter if it is sustainable or not.


It matters if it's sustainable, but having products in the pipeline adds to capital and that can help things out in the short term.

Tesla's done very expensive EVs -> home/utility PV/storage -> moderately expensive EVs -> FSD (really ADAS) -> Semi/cyber truck so far.

Besides decreasing costs/ramping/improving what they have so far, my guess is they're they're moving to the FSD taxi/model 2/Optimus next.

I suspect Nvidia will try a similar approach to AI hardware/software.


As of the latest update FSD is not yet FSD, but it’s _way_ more than ADAS. Underestimate it at your own peril. It behaves like an experienced, if too timid, driver.


It's also wild to me that it appears to understand driving mostly in the context of other vehicles/people. FSD on city streets is as good as if not better than EAP was on highways in 2019.

Doing all this at +/-200wh/mile is pretty good too. I remember when (2000-2008?) people claimed it wouldn't be viable to get an EV below 500wh/mile.


Yeah, 12.3+ is pretty good. It handles 80-90% of my mileage, and the majority of the time I take over because it's too slow.

The only place where it's worse than earlier versions is yellow lights. I'm going to dl/try 12.4 tomorrow.

Being able to move from FSD to optimus to whatever comes next is where Tesla can really shine.


and that it grows using current day approaches and technologies...


Regardless NVDA gonna pop Monday on this news


What news?


Random commenter on HN just tried to do a bearish analysis that Morningstar or any other analysis group you can obtain. News at 11.

How do we know that AI is going to be the only thing that GPGPUs are the end game?


Maintenance trains, interesting takeaway from this. There's a trolley problem joke in there somewhere.


We have gone full circle back to building the HPC supercomputer after quarter century of clusters.


Sounds like this opsplanner software has a fair bit of autonomy


They make people into slaves. I’d be extremely happy when Facebook gets shutdown.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: