Congrats on the launch! Im curious about what kinds of workloads you see GPU-acc...

winwang · 2025-05-12T17:16:28 1747070188

Large scale shuffles: Absolutely. One of the larger queries we ran saw a 450TB shuffle -- this may require more than just deploying the spark-rapids plugin, however (depends on the query itself and specific VMs used). Shuffling was the majority of the time and saw 100% (...99%?) GPU utilization. I presume this is partially due to compressing shuffle partitions. Network/disk I/O is definitely not the bottleneck here.

It's difficult to say what "workloads" are significant, and easier to talk about what doesn't really work AFAIK. Large-scale shuffles might see 4x efficiency, assuming you can somehow offload the hash shuffle memory, have scalable fast storage, etc... which we do. Note this is even on GCP, where there isn't any "great" networking infra available.

Things that don't get accelerated include multi-column UDFs and some incompatible operations. These aren't physical/logical limitations, it's just where the software is right now: https://github.com/NVIDIA/spark-rapids/issues

Multi-column UDF support would likely require some compiler-esque work in Scala (which I happen to have experience in).

A few things I expect to be "very" good: joins, string aggregations (empirically), sorting (clustering). Operations which stress memory bandwidth will likely be "surprisingly" good (surprising to most people).

Otherwise, Nvidia has published a bunch of really-good-looking public data, along with some other public companies.

Outside of Spark, I think many people underestimate how "low-latency" GPUs can be. 100 microseconds and above is highly likely to be a good fit for GPU acceleration in general, though that could be as low as 10 microseconds (today).

_zoltan_ · 2025-05-12T18:57:12 1747076232

8TB/s bandwidth on the B200 helps :-) [yes, yes, that is at the high end, but 4.8TB/s@H200, 4TB/s@H100, 2TB/s@A100 is nothing to sneeze at either).

winwang · 2025-05-12T19:14:01 1747077241

Very true. Can't get those numbers even if you get an entire single-tenant CPU VM. Minor note, A100 40G is 1.5TB/s (and much easier to obtain).

That being said, ParaQuery mainly uses T4 and L4 GPUs with "just" ~300 GB/s bandwidth. I believe (correct me if I'm wrong) that should be around a 64-core VM, though obviously dependent on the actual VM family.