Hacker News new | past | comments | ask | show | jobs | submit login
Tortoise: Shell-Shockingly-Good Kubernetes Autoscaling (github.com/mercari)
28 points by sanposhiho on March 21, 2024 | hide | past | favorite | 28 comments



At Mercari, the responsibilities of the Platform team and the service development teams are clearly distinguished. Not all service owners possess expert knowledge of Kubernetes.

Also, Mercari has embraced a microservices architecture, currently managing over 1000 Deployments, each with its dedicated development team.

To effectively drive FinOps across such a sprawling landscape, it's clear that the platform team cannot individually optimize all services. As a result, they provide a plethora of tools and guidelines to simplify the process of the Kubernetes optimization for service owners.

But, even with them, manually optimizing various parameters across different resources, such as resource requests/limits, HPA parameters, and Golang runtime environment variables, presents a substantial challenge.

Furthermore, this optimization demands engineering efforts from each team constantly - adjustments are necessary whenever there’s a change impacting a resource usage, which can occur frequently: Changes in implementation can alter resource consumption patterns, fluctuations in traffic volume are common, etc.

Therefore, to keep our Kubernetes clusters optimized, it would necessitate mandating all teams to perpetually engage in complex manual optimization processes indefinitely, or until Mercari goes out of business.

To address these challenges, the platform team has embarked on developing Tortoise, an automated solution designed to meet all Kubernetes resource optimization needs.

This approach shifts the optimization responsibility from service owners to the platform team (Tortoises), allowing for comprehensive tuning by the platform team to ensure all Tortoises in the cluster adapts to each workload. On the other hand, service owners are required to configure only a minimal number of parameters to initiate autoscaling with Tortoise, significantly simplifying their involvement.


I would find it annoying for the platform team to readjust the specs of the pods I'm running on. To give insights is valuable, but otherwise it's an invitation for incidents to happen.


Doing things I don't understand myself is also a recipe for disaster, and in my experience a rather greater one. The platform team is liable to make mistakes like scaling the service wrong or failing to anticipate upcoming changes. These incidents can be easily resolved by improving monitoring and communication, which are fundamentally useful things that I should already be doing for myriad other reasons. The mistakes I'm likely to make are things like "sequenced a complicated change wrong and null-routed the entire application" or "typo'd a volume name and found out that that autodeletes the entire database including backups", which I am simply not good at avoiding and constitute one of the major reasons I am in engineering instead of ops or IT. We are better off if I do the things I am best at and they do the things they are best at.


I would say both of what you said and what I said are recipes for disaster, but letting another team do things behind your back on things that you're responsible for, is not something you want to have. How would you feel if the cloud provider engineers suddenly downgraded your nodes to different specs, and causing downtime for your users? I think it's a false premise to assume that the application teams cannot observe their usage patterns and optimize themselves.


I think this mostly comes down to whether applications can handle downtime if their workloads are restarted, scale up/down based on demand.

It happens shockingly often that applications only support working with a single replica and even worse when those applications cannot run concurrently with replicas of themselves which prevent smooth rolling updates.

IME if applications are fault tolerant of restarts, or support concurrent replicas then scaling up and down to meet demand is absolutely fine.


The reality for most engineers is that their CTOs stopped caring about tech somewhere between the late 90s and mid 2000s. You'll have to put up with processes designed by some dude who still views platform orgs as a bunch of sysadmins and webmasters.


Treating performance and reliability (which is inescapably impacted upon by performance characteristics) as externalities is a great way to create perverse incentives for your engineering team.

Also this reads like a cry for help:

> Therefore, to keep our Kubernetes clusters optimized, it would necessitate mandating all teams to perpetually engage in complex manual optimization processes indefinitely, or until Mercari goes out of business.


Or you could learn the platform you are deploying your software to


Wasn’t “Shellshock” the name for a very severe security bug? When I read “shell-shockingly good Kubernetes”, at first, I thought this was an article warning about container security.


I believe originally it was a name for PTSD, that's an even more curious association. In case it wasn't obvious, tortoise-related puns can be taken too far :)


Yes: shell-shock is never good, which makes "shell-shockingly good" a poor turn of phrase even in the context of terrapins.


Yes. It was a bug in bash -- the Bourne again shell -- that permitted the execution of arbitrary code by embedding a function definition into environment variables. It was a Big Deal because it was remotely exploitable in systems that set environment variables according to user input before executing the shell -- the most severe was probably the Apache HTTP Server's CGI handler, which sets environment variables from HTTP header data.

I had the same initial reaction, though. "Kubernetes? Shellshock? Oh noes!"


From this blog(https://engineering.mercari.com/en/blog/entry/20240206-3a12b...), it appears that Tortoise provides a feature that I think could be called managed HPV. It offers a way for administrators to configure and manage complex HPV policies while requiring minimal configuration from application developers (i.e., the users of HPV). BTW, I have a pet turtle and really like the pictures in the documentation.


All our scaling issues are database bound, but it seems all the auto-scaling solutions seem to tackle CPU-bound problems. Are we unique, or do most folks encounter CPU-bound scaling issues?


Most startups use inefficient languages like Ruby, PHP, Elixir, or JavaScript. They also generally "move fast" and don't stop to optimise anything. So it's not unusual for their code to end up 10-100x slower than it could be if developed with something like ASP.NET.

Additionally, many startups develop apps for scenarios where the database scaling can be solved. Typically they have many customers and can shard the data, or they can tolerate eventual consistency and hence can just throw caching layers at the problem.

Transactional enterprise apps can't do this, they're single instance and caching could cause all sorts of weird data loss or corruption. Hence, the database becomes the most common bottleneck.


Most just utilize out of the box macro resources available in HPA.

For more advanced use cases there is keda - https://keda.sh/


Is there a good example case study, that shares all the tweaking this program did to modify HPAs and VPAs — even just some charts over 1 year of use.


I think autoscaling is a issue on k8s. The smart autoscalers are written by google or azure for their own services and as far as i know, neither open source nor available in k8s.

There should be a k8s metric of 'cluster utilization'/'packing metric' the autoscaler from k8s should reschedule pods but rescheduling only exists for node drain, Priority and resource issues but not for packing a cluster.

You can only do this with Deschedule, which is not part of the k8s distri itself...

Thats it about my generic k8s rant (i do love k8s nonetheless).

Whats missing is an annotation on the specs to clarify how aggressive an autoscaler can handle certain pods (like do not move this stateful workload ever, reschedule this type of pod at night or look at this metric to see when its less active).

From an algorithm standpoint, its probably the same issue as java has with old and young generation: there are plenty of stable pods, which just need to run always. The base load, easy to pack, easy to keep aligned, seldom repackaking needed. Than you have the young generation: unclear how long the workload is required. Might mean that you have a big node running nearly empthy for an additional x hours just for the pod to finish its task or for the load to scale down again. If you know more about the type of workload, you could decide to spin up a big node or a lot more smaller nodes from a number of node pools. If you know that this job runs for 20 hours but is gone after, put it on a small node.

You could also do a strategy (depending on the cluster size) to only have one big node between 1-99% and make sure that all other nodes are always packed.

The project itself is tbh. shitty in describing how it works. I was neither able to read in the README or in the linked blogpost HOW they are doing it. The key information is missing. What algorithm does it use to utilize it?

I do love to have more autoscalers though.


Microsoft does a good job with KEDA, providing an open source autoscaling architecture that isn't tied to Azure.

https://keda.sh/ - project website

https://github.com/kedacore/ - code


The problem, I suspect is that different places have different needs from a “packing metric” and a scheduler, so there’s almost certainly no one-size-fits-all solution. The big players probably have custom schedulers that are capable of doing all the things we wish the standard K8s scheduler could do (de-scheduling, reevaluating existing pods, purgatory pools, etc) but nobody has the spare time and resources to do. Different places I suspect also want different things from a “packing metric”. I love the idea, but as a counter argument, one previous product my day job had ran customer workloads. These had extremely well-defined min and max resources, and we were primarily interested in packing as many onto a node as we possibly could. Conversely our current product has radically different needs for ideal packing: minimal nodes such that certain critical workloads have the space to expand as needed.


Absolutlty and its not an easy problem thats why i would like to see it as a core component of k8s and not implemented in random opensource projects.

The decision or packing algorithm could be architectured to be more extensable perhaps or finetuned but the core should be there and i do miss specs which would tell any autoscaler how to act/interact with different pods.

I would also love to see an autoscaler simulator.

Its a hard problem at the heart of k8s which gets fully ignored.



Yepp, i'm aware of it.

It has the same issues as the project from this hn thread: Its not part of k8s and it also doesn't tell you their strategy at a glance.


I'm pretty sure cluster-autoscaler evicts pods if he decides that this particular node could be removed. I saw that in my cluster and had to apply annotation to some pods to prevent that. So it does packing to some extent.

I'm using managed kubernetes, but from a relatively small provider and I doubt they did any custom coding for their autoscaler.

I agree that it could do more. Sometimes I have to manually juggle pods around to remove one node, if scheduler did bad job.


I'm pretty sure your provider uses some autoscaler from one of the projects.

It has to know how to schedule new nodes which is provider dependend.

But thats my critisism: I believe that the autoscaler should be a k8s core component and not a 'select whatever opensource project you like based on no evidence OR if you are lucky based on experience someone else made over the years with all the choices out there'.

But as you write it yourself: You still need to do something manually and i bet you don't have that much control over telling the autoscaler to act differently depending on the pods.


That's a good analogy with GC generations.


We have a similar in house utility that looks at the P90 resource utilization (via prom) for last 48hrs and adjusts requests to this level.

This allows our scheduler to place/pack the workload according to our preference.


The cute tortoise pictures are great!




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: