During many years of operating several-thousands of nodes production clusters on...

f0e4c2f7 · 2024-12-16T14:40:03 1734360003

There is a funny parallel I see with Kubernetes that I also saw a lot with Linux in the early years. There are thousands of packages and tools you can install on Linux (think phpmyadmin for example) and new users sometimes go wild installing every single package they read about.

After a while, the more mature Linux engineers start going the other way. Ripping out as much as possible. Stripping down to the leanest build they can, for performance but also to reduce attack surface and overall complexity.

Very similar dynamic with k8s. Early days are often about scooping up every CNCF project like you're on a shopping spree. Eventually people get to shipping slim clusters running and 30mb containers with alpine or nix. Using it essentially as open source clustering for Linux.

atombender · 2024-12-16T14:26:04 1734359164

What's surprising to me is that there's no way to listen to any object type. You have to know the "kind" beforehand, because the watch API requires it. To watch all objects in the system, you have to start a separate watch request for every type. This may in turn be expensive.

If you have direct access to Etcd (which may not be possible in a managed cloud version of Kubernetes?), putting a watch on / might scale better.

(As an aside, with the Go client API you have to jump through some hoops to even deserialize objects whose kinds' schemas are not already registered. You have to use the special "unstructured" deserializer. The Go SDK often has to deal with unknown types, e.g. for diffing, and all of the serializer/codec/conversion layers in the SDK seem incredibly overengineered for something that could have just assumed a simple nested map structure and then layered validation and parsing on top; the smell of Java programmers is pretty strong.)

bluepizza · 2024-12-16T16:24:22 1734366262

The watch API has horrible user experience in all platforms. One must send a GET and keep the pipe open, waiting for a stream of responses. If the connection is lost, changes might be lost. If one misses a resource version change, then either the reconnection will fail, or a stale resource will be monitored.

The Java client does this with blocking, resulting in a large number of threads.

I truly like Kubernetes, and I think most detractors' complaints around complexity simply don't want to learn it. But the K8s API, especially the Watch API, needs some rigorous standards.

rtpg · 2024-12-16T07:05:23 1734332723

how are Kubernetes apiservers suffering this much from this kind of query? Surely even in huge systems the amount of data that would need to be traversed is super small, right?

Is this a question of Kubernetes just sticking everything into "standard" datastructures instead of using a database?

nvarsj · 2024-12-16T12:23:41 1734351821

My knowledge is out of date now, but the main issues IMO are/were:

- No concept of apiserver rate limiting, by design. I see there is now an APF thingy, but still no basic API / edge rate limiting.

- etcd has bad scalability. It's a very basic, highly consistent kv store that has tiny limits (8GB limit in latest docs, with a default of 2GB). It had large performance issues throughout its life when I was using k8s, I still don't know if it's much better.

crabbone · 2024-12-16T13:48:51 1734356931

Long ago I wanted to re-implement at least part of kubectl in Python. After all, Kubernetes has documented API... what I quickly discovered was that kubectl commands don't map to Kubernetes API. Almost at all. A lot of these commands will require multiple queries going back and forth to accomplish what the command does. I quickly abandoned the project... so, maybe I've overlooked something, but, again, my impression was that instead of having generic API with queries that can be executed server-side to retrieve necessary information, Kubernetes API server offers very specialized disjoint set of commands that can only retrieve one small piece of interesting info at a time.

This, obviously, isn't a scalable approach, but there's no "wrapper" you could write in order to mitigate the problem. The API itself is the problem.

mltsd · 2024-12-16T09:04:10 1734339850

Pretty sure the apiserver just queries the etcd database (and maybe caches some things, not sure) but i guess it could be the apiserver itself that can't handle the data :P

alpb · 2024-12-16T16:29:50 1734366590

Kubernetes only lets you query resources by object type and that's only a prefix range scan on etcd database. There are no indexes whatsoever in the exhaustive LIST queries, and kube-apiserver handles serialization of the objects back and forth between multiple wire types. Over the years there has been a lot of optimizations, but you don't wanna list all pods in a 5000 node high density cluster every time you spin up client-side tools like this.

remram · 2024-12-16T15:16:06 1734362166

In my experience, they don't, you can just run more of them and you can stick them behind a load-balancer (regular HTTP reverse proxy). You can scale both etcd and apiserver pretty easily. Of course you have less control in cloud environments, I have less experience with that.

tasuki · 2024-12-16T10:25:05 1734344705

I no longer know anything about Kubernetes, but share your surprise! From first principles it seems the metadata should be small.

moshloop · 2024-12-16T19:41:00 1734378060

This is the approach we took while building our Internal Developer Platform: watches (via client-go informers with client-side caching) to sync data into a Postgres database as JSONB. Changes are tracked using JSON patches and Kubernetes events. To avoid a watch on every resource kind, we handle this by performing incremental object fetches for the objects involved in watched events.

Getting this to perform well required several optimizations at both the Go and Postgres levels. On the Go side, we use prioritized work queues, event de-duplication, and even switched to Rust for efficient JSON diffs. For Postgres, we leverage materialized views and trigger-based optimistic locking

elliotxx · 2024-12-17T03:42:38 1734406958

That's how https://github.com/KusionStack/karpor did it. It has a resource-syncer component to synchronize resources in real time to Elasticsearch, and then allows users to search for K8S resources through SQL and natural language through a search bar on a web UI.

In fact, recently it is preparing to integrate Cyphernetes as a new search method. I believe this will be a new start!

ramoz · 2024-12-16T06:03:37 1734329017

How fun was kube-ops-view though

fatliverfreddy · 2024-12-16T05:56:08 1734328568

This is a very good point and is on the roadmap.