I cannot overstate the performance improvement of deploying onto bare metal. We typically see a doubling of performance, as well as extremely predictable baseline performance.
This is down to several things:
- Latency - having your own local network, rather than sharing some larger datacenter network fabric, gives around of order of magnitude reduced latency
- Caches – right-sizing a deployment for the underlying hardware, and so actually allowing a modern CPU to do its job, makes a huge difference
- Disk IO – Dedicated NVMe access is _fast_.
And with it comes a whole bunch of other benefits:
- Auto-scalers becomes less important, partly because you have 10x the hardware for the same price, partly because everything runs 2x the speed anyway, and partly because you have a fixed pool of hardware. This makes the whole system more stable and easier to reason about.
- No more sweating the S3 costs. Put a 15TB NVMe drive in each server and run your own MinIO/Garage cluster (alongside your other workloads). We're doing about 20GiB/s sustained on a 10 node cluster, 50k API calls per second (on S3 that is $20-$250 _per second_ on API calls!).
- You get the same bill every month.
- UPDATE: more benefits - cheap fast storage, run huge Postgresql instances at minimal cost, less engineering time spend working around hardware limitations and cloud vagaries.
And, if chose to invest in the above, it all costs 10x less than AWS.
Pitch: If you don't want to do this yourself, then we'll do it for you for half the price of AWS (and we'll be your DevOps team too):
I work at Google on these systems everyday (caveat this is my own words not my employers)). So I simultaneously can tell you that its smart people really thinking about every facet of the problem, and I can't tell you much more than that.
However I can share this written by my colleagues! You'll find great explanations about accelerator architectures and the considerations made to make things fast.
Edit:
Another great resource to look at is the unsloth guides. These folks are incredibly good at getting deep into various models and finding optimizations, and they're very good at writing it up. Here's the Gemma 3n guide, and you'll find others as well.
The price really is eye watering. At a glance, my first impression is this is something like Llama 3.1 405B, where the primary value may be realized in generating high quality synthetic data for training rather than direct use.
I keep a little google spreadsheet with some charts to help visualize the landscape at a glance in terms of capability/price/throughput, bringing in the various index scores as they become available. Hope folks find it useful, feel free to copy and claim as your own.
The Dunning-Kruger effect is strong here. "What I know is what makes me the expert. What I don't know is irrelevant".
The definition of Standard deviation is in chapter 1 of Stats 101.
https://www.google.com/search?q=standard+deviation&tbm=isch
Apparently, asking Stats 101 chapter 1 question of a so called "Data Scientist" is too much of an irrelevant question!
> expect people to have very high Stats skills
Or as you have made apparent, expect people to have ZERO stats skills!
Some of the innumerate activities I have observed in "expert" data scientists and ML engineers who have years of experience without once thinking about sample sizes
1. Using A/B tests to accept the Null hypothesis instead of rejecting it
2. Squandering away 30M $ in annual revenue because they wanted to avoid a situation/meeting in which they might look like they don't understand statistics. This is hilarious because they simply nodded their head as if they understand all the calculations and then simply dropped any other meetings or followups and left 30M $ on the table
3. Not refreshing a key revenue generating model for 18 months because the were "trying to figure out" why the AUC was improving when the performance on "golden set data" was dropping
4. Using thresholding and aggregation to produce poor quality distorted training data of rich perfectly sampled data
5. Trying to use A/B tests to estimate impact even when the control and variant are not independent
All of the above at FAANGS! My coworkers in a non FAANG company were much more sophisticated. These are the kind of candidates a "build recommendations for youtube" interview selects. Template appliers.
The list of stupidities goes on and on! But yeah, none of them think that a basic understanding of statistics is necessary for work. The good thing about Javascript engineers is that they don't have an understanding of Statistics and are aware of it. However the DS/MLEs are unskilled and unaware of it.
My experience working on ml at couple faang like companies is gpus actually tend to be too fast compute wise and often models are unable to come close to theoretical nvidia flops numbers. In that very frequently bottlenecks from profiling are elsewhere. It is very easy to have your data reading code be bottleneck. I have seen some models where our networking was bottleneck and could not keep up with the compute and we had adjust model architecture in ways to reduce amount of data transferred in training steps across the cluster. Or maybe you have gpu memory bandwidth as bottleneck. Key idea in flash attention work is optimizing attention kernels to lower amount of vram usage and stick to smaller/faster sram. This is valuable work, but is also kind of work that is pretty rare engineer I have worked with would have cuda kernel experience to create custom efficient kernels. Some of the models I train use a lot of sparse tensors as features and tensorflow’s sparse gpu kernel is rather bad with many operations either falling back to cpu or sometimes I have had gpu sparse kernel that was slower than cpu equivalent kernel. Several times densifying and padding tensors with large fraction of 0’s was faster than using sparse kernel.
I’m sure a few companies/models are optimized enough to fit ideal case but it’s rare.
Edit: Another aspect of this is nature of model architecture that are good today is very hardware driven. Major advantage of transformers over recurrent lstm models is training efficiency on gpu. The gap in training efficiency is much more dramatic with gpu than cpu for these two architectures. Similarly other architectures with sequential components like tree structured/recursive dynamic models tend to fit badly for gpu performance wise.
Anyone has other sites like these to share?
- Domain-driven design, design patterns, and antipatterns
https://deviq.com/
- Refactoring and Design Patterns
https://refactoring.guru/
- Standard Patterns in Choice-Based Games
https://heterogenoustasks.wordpress.com/2015/01/26/standard-...