More

smarterclayton · 2024-08-02T01:00:44 1722560444

CPU tracking is provided by the metrics API, which either reads kubelet metrics directly (the original, old, but simplest way), or a metrics adapter that reads the metrics from a third party collector and implements the API.

The behavior is supported by the v1 api rules so no, it’s extremely unlikely to be moved out.

That said, with GPU workloads gaining steam I wouldn’t be surprised if we added new “supported everywhere” metrics at some point.

smarterclayton · 2024-06-13T16:47:37 1718297257

Pretty well. Anthropic runs some of Claude inference on GKE and TPU v5e - this talk at the last GCP conference has some details:

https://youtu.be/b87I1plPeMg?si=T4XSFUzXG8BwpphR

Ecosystem support for GPU is very strong and so smaller teams might find TPUs to have a steeper learning curve. Sophisticated users can definitely get wins today on TPU if they can get capacity.

And it’s not as if nvidia is standing still, so how the strengths of each change over future generations isn’t set either. Ie TPU are “simple” matrix multipliers and also optimized for operation at scale, GPU are more general purpose and have strong ecosystem power.

Disclaimer - work on GKE on enabling AI workloads.

smarterclayton · on Feb 21, 2024

There is a lot of work to make the actual infrastructure and lower level management of lots and lots of GPUs/TPUs open as well - my team focuses on making the infrastructure bit at least a bit more approachable on GKE and Kubernetes.

https://github.com/GoogleCloudPlatform/ai-on-gke/tree/main

and

https://github.com/google/xpk (a bit more focused on HPC, but includes AI)

and

https://github.com/stas00/ml-engineering (not associated with GKE, but describes training with SLURM)

The actual training is still a bit of a small pool of very experienced people, but it's getting better. And every day serving models gets that much faster - you can often simply draft on Triton and TensorRT-LLM or vLLM and see significant wins month to month.

smarterclayton · on Jan 5, 2024

As an aside:

This principle (always shutdown uncleanly) was a significant point of design discussion in Kubernetes, another one of the projects that adapted lessons learned inside Google on the outside (and had to change as a result).

All of the core services (kubelet, apiserver, etc) mostly expect to shutdown uncleanly, because as a project we needed to handle unclean shutdowns anyway (and could fix bugs when they happened).

But quite a bit of the software run by Kubernetes (both early and today) doesn’t always necessarily behave that way - most notably Postgres in containers in the early days of Docker behaved badly when KILLed (where Linux terminates the process without it having a chance to react).

So faced with the expectation that Kubernetes would run a wide range of software where a Google-specific principle didn’t hold and couldn’t be enforced, Kubernetes always (modulo bugs or helpful contributors regressing under tested code paths) sends TERM, waits a few seconds, then KILLs.

And the lack of graceful Go http server shutdown (as well as it being hard to do correctly in big complex servers) for many years also made Kube apiservers harder to run in a highly available fashion for most deployers. If you don’t fully control the load balancing infrastructure in front of every server like Google does (because every large company already has a general load balancer approach built from Apache or nginx or haproxy for F5 or Cisco or …), or enforce that all clients handle all errors gracefully, you tend to prefer draining servers via code vs letting those errors escape to users. We ended up having to retrofit graceful shutdown to most of Kube’s server software after the fact, which was more effort than doing it from the beginning.

In a very real sense, Google’s economy of software scale is that it can enforce and drive consistent tradeoffs and principles across multiple projects where making a tradeoff saves effort in multiple domains. That is similar to the design principles in a programming language ecosystem like Go or orchestrator like Kubernetes, but is more extensive.

But those principles are inevitably under communicated to users (because who reads the docs before picking a programming language to implement a new project in?) and under enforced by projects (“you must be this tall to operate your own Kubernetes cluster”).

smarterclayton · on Dec 6, 2023

From the report:

"As part of the evaluation process, on a popular benchmark, HellaSwag (Zellers et al., 2019), we find that an additional hundred finetuning steps on specific website extracts corresponding to the HellaSwag training set (which were not included in Gemini pretraining set) improve the validation accuracy of Gemini Pro to 89.6% and Gemini Ultra to 96.0%, when measured with 1-shot prompting (we measured GPT-4 obtained 92.3% when evaluated 1-shot via the API). This suggests that the benchmark results are susceptible to the pretraining dataset composition. We choose to report HellaSwag decontaminated results only in a 10-shot evaluation setting. We believe there is a need for more robust and nuanced standardized evaluation benchmarks with no leaked data."

ZeroCool2u · on Dec 6, 2023

Great catch!

smarterclayton · on Nov 10, 2023

Also, I should point out that a set of machines hosting TPUs is referred to as a "pod", which is not the same thing as a Kubernetes pod (also referenced in this doc).

The term "pod" originated in early data center design and occasionally crosses over from HPC to broad use - i.e. nVidia calls the set of DGX machines a "pod" https://blogs.nvidia.com/blog/2021/03/05/what-is-a-cluster-p....

Kubernetes chose "pod" to represent a set of co-scheduled containers, like a "pod of whales". Other systems like Mesos and Google's Borg https://storage.googleapis.com/pub-tools-public-publication-... use "task" to refer to a single container but didn't have a concept for heterogenous co-scheduled tasks at the time.

Somewhat ironically, it now means TPUs on GKE are confusing because we have TPUs hosts organized into "pods", and "pods" for the software using the TPUs.

A Kubernetes pod using a TPU lands on a host which is part of a slice of a TPU pod.

jeffbee · on Nov 10, 2023

As your second link mentions in section 2.4, Borg has "allocs" which are basically pods.

smarterclayton · on Nov 10, 2023

True. The pod’s monotonic and atomic lifecycle across containers is a significant difference, but you can broadly accomplish similar behaviors with an alloc for sharing resources.

smarterclayton · on Nov 10, 2023

Disclaimer: work associated with this team, didn't write or review the blog post

Article stated that it was throughput scheduling the pods on the clusters (from unrelated benchmarks that's usually ~300 pods/sec throughput for kube scheduler today) and then doing XLA compilation at pod launch, rather than amortizing once for all jobs.

Optimizing throughput of kube scheduler is a good general opportunity and something I believe we would like to see.

I believe AOT compilation just not a critical optimization for the test, we would recommend it when running large and long training jobs to AOT compile to keep pod start latency low for hardware failures and job restarts (from checkpoints).

> The start times we observed were impressive, but we believe we can improve these even further. We are working on areas such as optimizing scheduling in GKE to increase throughput and enabling ahead-of-time compilation in MaxText to avoid just-in-time compilations on the full cluster.

jeffbee · on Nov 10, 2023

Thanks for the context. I remember recently reading a paper from I think Baidu where they claimed to have a container arrival rate in the millions per second, consequently it was practical to operate their whole site in the style of lambda/cloud functions.

Actually now that I am searching for that it seems Baidu has a number of papers on workload orchestration at scale specifically for learning.

smarterclayton · on Nov 10, 2023

I will note that a trend I have observed with recent ML - as we increasingly use accelerators and models correspondingly grow in size, we are returning to a "one machine, one workload" paradigm for the biggest training and inference jobs. You might have 8k accelerators, but only 1000 machines, and if you have one container per host 300 schedules / second is fast.

While at the same time as you note we have functional models for container execution that are approaching millions of dispatches for highly partitionable work, especially in data engineering and ETL.

smarterclayton · on Aug 9, 2023

> kubernetes went ahead and implemented an oop system in golang

I don’t think this was ever an objective, can you clarify what you mean by “oop system” and where we implemented it?

We aggressively used composition of interfaces up to a point in the “go way”, but certainly in terms of serialization modeling (which I am heavily accountable for early decisions) we also leveraged structs as behavior-less data. Everything else was more “whatever works”.

> why? because they felt go's assumption about the problem space (that it can be modeled and solved as imperative programs)

You’d have to articulate which subsystems you feel support this statement - certainly a diverse set of individuals were able to quickly and rapidly adapt their existing mental models to Go and ship a mostly successful 1.0 MVP in about 12 months, which to me is much more of the essential principle of Go:

pragmatism and large team collaboration

gammadist · on Aug 9, 2023

Presumably, the poster is referencing the patterns described in this talk: https://archive.fosdem.org/2019/schedule/event/kubernetesclu...

> Unknown to most, Kubernetes was originally written in Java. If you have ever looked at the source code, or vendored a library you probably have already noticed a fair amount of factory patterns and singletons littered throughout the code base. Furthermore Kubernetes is built around various primitives referred to in the code base as “Objects”. In the Go programming language we explicitly did not ever build in a concept of an “Object”. We look at examples of this code, and explore what the code is truly attempting to do. We then take it a step further by offering idiomatic Go examples that we could refactor the pseudo object oriented Go code to.

smarterclayton · on Aug 9, 2023

I have a lot of respect for Kris but in this context, as the person who approved the PR adding the “Object” interface to the code base (and participated in most of the subsequent discussions about how to expand it), it was not done because we felt Go lacked a fundamental construct or forces an imperative style. We simply chose a generic name because at the time “not using interface{}” seemed to be idiomatic go.

The only real abstraction we need from Go is zero-cost serialization frameworks, which I think is an example of where having that would compromise Go’s core principles.

voidhorse · on Aug 9, 2023

I have nothing against go or kubernetes, I was simply citing the linked article in the thread, which at one point states:

> Golang] Kubernetes, one of the largest codebases in Golang, has its own object-oriented type system implemented in the runtime package.

I would agree that Golang is overall a great language well-suited for solving a large class of problems. The same is true for the other languages I cited. This isn't about immutable properties of languages so much as it is about languages and problems fitting or not fitting well together.

smarterclayton · on Aug 9, 2023

That’s fair - as the perpetrator of much of that abstraction I believe that highlighting the type system aspect of this code obscures the real problem we were solving for, which Go is spectacularly good at: letting you get within inches of C performance with reasonable effort for serialization and mapping versions of structs.

I find it amusing that the runtime registry code is cited as an example for anything other than it’s possible to overcomplicate a problem in any language. We thought we’d need more flexibility than we needed in practice, because humans involved (myself included) couldn’t anticipate the future.

throwawaymaths · on Aug 9, 2023

But do you really need "near c" performance for a container orchestrator? I would think your bottleneck will be talking to etcd

smarterclayton · on Aug 10, 2023

Our bottleneck was serialization of objects to bytes to send to etcd. Etcd cpu should be about 0.01-0.001 control plane CPU, and control plane apiserver CPU has been dominated by serialization since day 1 (except for brief periods around when we introduced inefficient custom resource code, because of the generic nature of that code).

throwawaymaths · on Aug 10, 2023

Huh, I'd have guessed that distributed synchronization of etcd (which goes over the network and has multiple events, and uses raft, which has unbounded worst case latency) would be the limiting step.

empath75 · on Aug 10, 2023

Kubernetes at this point is a general api server that happens to include a container scheduler.

smarterclayton · on July 10, 2023

Agreed that is a feature and not a bug.

But! The one thing that custom orchestrators can’t do is easily get the benefit of kubelet isolation of containers and resource management. Part of slowly moving down this path is to allow those orchestrators to get isolation from the node without having to reimplement that isolation. But it will take some time.

birdyrooster · on July 11, 2023

Oh I see orchestrating runtimes is quite different. Good points!

smarterclayton · on July 10, 2023

In general, the intent here is to leave open room for just that.

dependsOn was proposed during the kep review but deferred. But because init containers and regular containers share the same behavior and shape, and differ only on container restart policy, we are taking a step towards “a tree of container node” without breaking forward or backward compatibility.

Given the success of mapping workloads to k8s, the original design goal was to not take on that complexity originally, and it’s good to see others making the case for bringing that flexibility back in.