Kubernetes: a complicated orchestration system that runs stateless applications ...

figmert · on Sept 22, 2023

Your definition of Kubernetes is accurate for maybe 5 years ago. These days Kubernetes is perfectly capable of running stateful services and supports them as a first class citizen.

rjzzleep · on Sept 23, 2023

Whenever someone runs Kubernetes onprem I tell them to buy a TrueNAS or another cheap SAN. A cheap SAN costs as much as a DevOps expert setting up your Ceph infrastructure and a lot less when you actually run into issues with that software defined storage solution.

Once you do that Kubernetes is actually quite nice, because it gives you a base configuration of postgres that comes with automatic backup setup etc.

That advice is rarely taken though ...

ahoka · on Sept 23, 2023

Solid advice. What driver/provisioner works fine for this?

rjzzleep · on Sept 23, 2023

There used to be the external-storage-provisioner[1] but what you need today is the second link[2]

[1] https://github.com/kubernetes-retired/external-storage/tree/...

[2] https://github.com/kubernetes-sigs/sig-storage-lib-external-...

crabbone · on Sept 22, 2023

Nothing can be further from the truth.

Kubernetes has nothing to offer to anyone that wants to work with storage (but there's a myriad of CSIs).

Here, let me give you an example of a system that offers storage in a similar way to how Kubernetes offers... well, it isn't really that good at offering compute, but, at least it kind of does it. So, Ceph -- that's something that makes sense to run PostgreSQL on (it's a storage provider). Kubernetes isn't a storage provider. It doesn't know how to manage it even...

I.e. if you think that you run PosgreSQL on Kubernetes -- you are mistaken. Something else does it. Kubernetes is a proxy there, at best (but probably is completely irrelevant).

erulabs · on Sept 23, 2023

To quote you: “nothing could be further from the truth”

Ceph itself runs very well on K8s - see the Rook project.

Of course you can run psql on k8s. Psql doesn’t need a storage orchestration system - hell it technically doesn’t need any storage at all!

crabbone · on Sept 25, 2023

Rook is about using CSI... it doesn't run Ceph on Kubernetes. That's impossible because Ceph relies on functionality that exists in drivers (kernel modules) to run. CSI is the component that communicates between eg. rbd driver and the user-space (eg. Kubernetes controllers), but it doesn't run Ceph.

It's in principle impossible to do anything about block devices in containers like those used by Kubernetes because those rely on Linux processes and associated namespaces. There isn't a Linux namespace for block devices, the closest you can get is the filesystem namespace. In other words, you cannot manage block devices purely in containers, you need some help from the host operating system. And this is why I mentioned CSIs in my previous post.

erulabs · on Sept 27, 2023

It does run Ceph on Kubernetes. How else would you describe deploying OSDs to linux servers via Kubernetes other than "Ceph on Kubernetes"?

> That's impossible because Ceph relies on functionality that exists in drivers (kernel modules) to run

This statement doesn't make sense. All linux applications require kernel functionality. Yes, to deploy Ceph, you must run Linux systems with the desired kernel modules. Turns out, Rook sets that up for ya! This statement exposes a somewhat deep misunderstanding of what Kubernetes is.

I run into you in every thread that mentions k8s and I sense extreme vitriol and a huge lack of experience / understanding. Don't mistake my future lack of replies for an unsaid "you've misunderstood".

qaq · on Sept 23, 2023

Look at benchmarks of Postgres on Ceph

crabbone · on Sept 23, 2023

Part of my job is to measure storage performance...

I can tell you at leas this: there cannot be a meaningful "benchmark of Postgres on Ceph". Too many things will influence the benchmark way too much. You need to be a lot more specific when you talk about such benchmarks. Here are some things you will need to present:

* Are OSDs connected to the node being tested through network or are they closer (NVMe / SAS / SATA?). If network, what's the bandwidth? What's the latency? What if it's stuff like reliable Ethernet that's used for iSCSI / NVMe over IP or something like that?

* How much memory (relative to data at rest) does the node have?

* What is the layout of memory buffers in PostgreSQL?

* What is the setting used for synchronization in PostgreSQL?

* How much replication is going to happen (Ceph pool size)?

* Block sizes and frame sizes.

* Type of workload. Surprisingly, some queries can exploit parallelism in I/O while other queries cannot. Surprisingly some queries will need a lot of synchronization while others don't.

And there's more, it would be too tedious to try to give an exhaustive list of things to control for. And the problem is, at least those mentioned here can influence performance sometimes up to an order of magnitude, sometimes two orders of magnitude...

rjzzleep · on Sept 23, 2023

I've run into all of these issues. In the end it's much cheaper to just buy a cheap SAN than it is probably to pay a month of your expertise.

I guess that's why I struggle making cash-cow consulting gigs, because my clients are never long term dependent on me.

sebasmannem · on Sept 24, 2023

I also only run short term. I enable customers with designing and implementing the proper (CEPH e.a.) environment for their workloads. But I don't run their systems. I always handover to the rest of the technical staff. Bringing in a cheap SAN basically shifts that responsibility from people like me to the SAN vendor, and they could bring in hardware that might do the job.

Running on K8s, I feel you need two types of storage: - Block storage, with proper fsync (fast and reliable) - S3 storage Both MUST be CloudNative. I don't know if Cheap SAN is available with proper k8s CSI providers, but if they are, they could be up to the challenge.

Note that people like me can enable customers with both (choosing the proper 'cheap SAN', but also designing a proper storage environment with CEPH or other software storage solutions.

gbartolini · on Sept 23, 2023

Why not dedicate some worker nodes using taints/tolerations/labels, even on bare metal, with locally attached storage? I wrote this many years ago now but that's the reason why we started CloudNativePG (OpenEBS might not be the answer today, but there are many storage engines now, including topolvm which brings LVM to the game): https://www.2ndquadrant.com/en/blog/local-persistent-volumes...

It is ultimately your choice. I am a big fan of shared nothing architecture for the database. (I am a maintainer of CloudNativePG)

sebasmannem · on Sept 24, 2023

Yeah, and let Postgres take care of redundancy. I agree that this is an interesting proposition. AFAIK PortWorkx could do a similar thing, but then with storage redundancy. Basically: - storage is synced to 3 local storage devices spread across 3 different k8s nodes. This could be NVMe. - pod is only scheduled next to one of the three - reads are local, writes are local (for fsync) and synchronised to the other devices. I would love to test with pg_tps_optimizer against Portworkx

sebasmannem · on Sept 24, 2023

I both agree and don't agree about your comments. Benchmarks should be a comparison and one can very well do a comparison between exactly same deployment on exactly same infrastructure with 2 different storage types without going so deep into the weeds. It is crucial to understand the environment of the actual benchmark, but many of the things you mention are less important unless you want to investigate what actually is going on under the hood (hoping to improve something).

Also note that to many people looking to run database workloads on K8s / CEPH, knowing that someone was able to run with 18k TPS without pulling rabbits from their sleeves is much helpful, and people asking all of these details basically makes people less willing to share, which is not helpful at all.

Be that as it may, as mentioned on another thread, we ran benchmarks on Premise/open Shift / CEPH, and I will try to answer as much of your questions as possible on these benchmarks. If you want more details, LMK... * Stack is: Openshift - RBD - Network - CEPH node - VMWare VMDK - SAN storage * Network (AFAIK) is 10g, I haven't tested network latency or storage latency, but the roundtrip for a commit (which pg_bench and pg_tps_optimizer call latency) took about is 30ms running 233 clients / 17k TPS. * no fancy stuff like reliable Ethernet that's used for iSCSI / NVMe over IP or something like that

* I mostly ran with pg_tps_optimizer and it is designed to test storage performance (not performance from app perspective) the way it works things like shared buffer size is less important. But FYI, I ran with 2GB for cluster.spec.resources.limits.memory.

* What is the layout of memory buffers in PostgreSQL? I don't understand what you are trying to get at. Running on K8s, you should trust the operator to deploy as smart as possible and not worry about stuff like this unless you are trying to actually investigate and fix problems. I ran with standard settings.

* I tested with many options including. Single instance, async (with synchronous_commit is remote_write, on, remote_apply) and sync (remote_write, on, remote_apply). These tests where run on Azure VM, but I am fairly sure running on OpenShift/CEPH does not impact that much. Biggest difference with 13 clients, 12/13k TPS with sync and 17/18k TPS with async. Difference is smaller with higher number of clients. As the effect is larger with smaller number of clients, probably the effect is less severe with openshift/ceph.

* AFAIK we CEPH set to keep 3 replicas. TBH, I don't see how this is of much importance. CEPH RBD kernel driver writes to both replicas in parallel. Doing more in parallel has little impact on latency and bandwidth is not the issue.

* I don't know the Block sizes and frame sizes for sure. I expect it is default settings (4096).

* Type of workload. Yeah, this is important stuff. First of all, about pg_tps_optimizer. I have the most interesting information with pg_ts_optimizer. It basically runs update statements on a record in a table, and with 233 clients this is 233 tables. This really tests storage performance (we rule out things like semaphore locks). This might be compared to importing data with a separate client (which could run in parallel) for every table (or partition if you like). With pg_bench (default workload), we see similar graphs, but we see limitations with pg_bench with higher number of clients. As all data is in the same table(s) with higher number of clients they run into contention issues (probably semaphore softlocks). As this is not a limitation of storage, I personally find this less interesting.

sebasmannem · on Sept 24, 2023

We have run benchmarks on our environment (CNPG, Openshift and CEPH in Dutch gov) and compared to Azure Postgres and (CNPG on) Azure AKS. Pgbench and pg_tps_optimizer. CEPH indeed is a 'high bandwidth / high latency' storage solution, and as such we could get to comparable TPS but required more clients. With 2 vCPU the max TPS was about 17k/18k with AKS and also with OpenShift. But with AKS we required 34 parallel clients and with OpenShift/CEPH we required 233 clients. More clients <=> more in parallel <=> more TPS on CEPH... If you are interested I can share some graphs.

oooyay · on Sept 23, 2023

I run a stateful metadata cache on Kubernetes with a StatefulSet and EBS as block storage. It runs SQLite just fine.

As for the "setting up Kubernetes" comment, I think that could be true of Kubernetes years ago. Nowadays Platform Engineers generally build on its capabilities continually, and by the time a user is using it to schedule network, compute, and storage the setup for an application maybe takes a day or so without a template. Most of the platform engineering work I've done on Kubernetes had much more to do with lifecycle management than paining myself over initial provisioning.

antonvs · on Sept 23, 2023

> EBS as block storage

We do the equivalent on GCP. A lot of the criticism about Kubernetes storage seems to come from people using it onprem.

In that context, I can well imagine that it's a PITA to set up well. As rjzzleep commented in this thread:

> Whenever someone runs Kubernetes onprem I tell them to buy a TrueNAS or another cheap SAN. A cheap SAN costs as much as a DevOps expert setting up your Ceph infrastructure and a lot less when you actually run into issues with that software defined storage solution. Once you do that Kubernetes is actually quite nice

ahoka · on Sept 23, 2023

Why do you use a StatefulSet? I assume every instance has it own volume with its own sqlite and that backs the cache. Why not just Deployment? Failover would be easier in that case.

oooyay · on Sept 23, 2023

I scale it vertically. Cache refresh is non optimal and takes a few hours. I could make it better by having the instances talk, but frankly the service may never actually need to scale. It can handle thousands of rps off that single container.

getrealyall · on Sept 23, 2023

It's okay, your managed Kubernetes provider also just happens to sell a managed database service! Isn't that convenient?

spion · on Sept 23, 2023

More like six days lately.