If it was engineered right, it would take:
- transfer model weights from NVMe drive/RAM to GPU via PCIe
- upload tiny precompiled code to GPU
- run it with tiny CPU host code
But what you get instead is gigabytes of PyTorch + Nvidia docker container bloatware (hi Nvidia NeMo) that takes forever to start.
If it was engineered right, it would take:
- transfer model weights from NVMe drive/RAM to GPU via PCIe
- upload tiny precompiled code to GPU
- run it with tiny CPU host code
But what you get instead is gigabytes of PyTorch + Nvidia docker container bloatware (hi Nvidia NeMo) that takes forever to start.