There's also less of a special sauce for text models itself these days with the ...

smarterclayton · on Feb 21, 2024

There is a lot of work to make the actual infrastructure and lower level management of lots and lots of GPUs/TPUs open as well - my team focuses on making the infrastructure bit at least a bit more approachable on GKE and Kubernetes.

https://github.com/GoogleCloudPlatform/ai-on-gke/tree/main

and

https://github.com/google/xpk (a bit more focused on HPC, but includes AI)

and

https://github.com/stas00/ml-engineering (not associated with GKE, but describes training with SLURM)

The actual training is still a bit of a small pool of very experienced people, but it's getting better. And every day serving models gets that much faster - you can often simply draft on Triton and TensorRT-LLM or vLLM and see significant wins month to month.