A guide to open-source LLM inference and performance

abcdabcd987 · on Nov 21, 2023

Related discussion on serving finetuned LLMs: https://news.ycombinator.com/item?id=38196661

llwu · on Nov 21, 2023

Question on the "Batching memory-bound processes on a GPU" section - it says "This enables us to reuse parts of the model that we’ve already loaded into the GPU’s SRAM", but the 10 GB we are loading is into the HBM, right? How did we overcome the HBM <-> SRAM bottleneck?

More generally, how can we find out the size of the SRAM?

varunshenoy · on Nov 21, 2023

Good question. Yes, the 10GB available for batching is in the HBM. In a single forward pass, you move the entire model from HBM -> SRAM exactly once. In a batched forward pass, this is still the case, so you end up doing more compute for the same amount of memory movement.

You can calculate the SRAM as follows: an A100 has 108 SMs, and each SM has 192 KB in SRAM (shared memory, aka its L1 cache) [1]. Multiplied out, this is ~20 MB of total SRAM. This happens to match up with the diagram in the Flash Attention paper [2].

[1] https://developer.nvidia.com/blog/cuda-refresher-cuda-progra...

[2] https://arxiv.org/pdf/2205.14135.pdf

Const-me · on Nov 21, 2023

> How did we overcome the HBM <-> SRAM bottleneck?

Because every number we load from the model through that bottleneck gets reused, to compute different requests within the batch.

joaquincabezas · on Nov 21, 2023

Thanks a lot for the material Varun, neat presentation with exhaustive computations that make it easy to follow. Question on the serving part: vLLM, Deepspeed, TensorRT-LLM... ? Thanks!

varunshenoy · on Nov 21, 2023

Thanks!

vLLM for quick set up, TRT-LLM for best performance. Both available on https://baseten.co/.

bicepjai · on Nov 20, 2023

That’s really detailed explanation. Can we do something like this for M1 ultra/M2 ultra/M3 max with large RAM ?

varunshenoy · on Nov 20, 2023

Absolutely. Looks like the M1 Ultra has 800GB/s of memory bandwidth and ~20 TFLOPS of compute.

The same calculations from the post should hold, except with these new values.

alanaan · on Nov 21, 2023

great post. could you apply this same framework to optimize training as well?

varunshenoy · on Nov 21, 2023

Slightly different set of trade-offs, but similar mental model. You always use large batch sizes (compute bound) and the bottleneck usually ends up communication between GPUs/nodes.

seth_ · on Nov 20, 2023

love the deep dive here

samspenc · on Nov 20, 2023

Likely trending on home page since this is directly relevant to LLM costs, i.e., questions like "how much would it cost to rebuild ChatGPT from scratch".

simsspoons · on Nov 20, 2023

highly relevant question today

varunshenoy · on Nov 20, 2023