The standard GPU utilization metric reported by nvidia-smi, nvtop, Weights & Biases, Amazon CloudWatch, Google Cloud Monitoring, and Azure Monitor is highly misleading. It reports the fraction of time that any kernel is running on the GPU, which means a GPU can report 100% utilization even if only a small portion of its compute capacity is actually being used. In practice, we've seen workloads with ~1–10% real compute throughput while dashboards show 100%.
This becomes a problem when teams rely on that metric for capacity planning or optimization decisions, it can make underutilized systems look saturated.
We're releasing an open-source (Apache 2.0) tool, Utilyze, to measure GPU utilization differently. It samples hardware performance counters and reports compute and memory throughput relative to the hardware's theoretical limits. It also estimates an attainable utilization ceiling for a given workload.
GitHub link: https://github.com/systalyze/utilyze
We'd love to hear your thoughts!
Nvidia’s toolsets and APIs are under-documented, and the commercial-grade hardware itself is super unreliable.
Developers and operators just bear with the whole situation because there is no alternative. To the point that they are ready to jump to things like TPUs or other custom silicon.
Say what you will about Intel, but their documentation and the commercial-grade hardware were top-notch. I wish they find their footing and this time stay humble.
reply