At Mercari, the responsibilities of the Platform team and the service developmen...

yolo3000 · on March 21, 2024

I would find it annoying for the platform team to readjust the specs of the pods I'm running on. To give insights is valuable, but otherwise it's an invitation for incidents to happen.

saulrh · on March 21, 2024

Doing things I don't understand myself is also a recipe for disaster, and in my experience a rather greater one. The platform team is liable to make mistakes like scaling the service wrong or failing to anticipate upcoming changes. These incidents can be easily resolved by improving monitoring and communication, which are fundamentally useful things that I should already be doing for myriad other reasons. The mistakes I'm likely to make are things like "sequenced a complicated change wrong and null-routed the entire application" or "typo'd a volume name and found out that that autodeletes the entire database including backups", which I am simply not good at avoiding and constitute one of the major reasons I am in engineering instead of ops or IT. We are better off if I do the things I am best at and they do the things they are best at.

yolo3000 · on March 21, 2024

I would say both of what you said and what I said are recipes for disaster, but letting another team do things behind your back on things that you're responsible for, is not something you want to have. How would you feel if the cloud provider engineers suddenly downgraded your nodes to different specs, and causing downtime for your users? I think it's a false premise to assume that the application teams cannot observe their usage patterns and optimize themselves.

AeroNotix · on March 21, 2024

I think this mostly comes down to whether applications can handle downtime if their workloads are restarted, scale up/down based on demand.

It happens shockingly often that applications only support working with a single replica and even worse when those applications cannot run concurrently with replicas of themselves which prevent smooth rolling updates.

IME if applications are fault tolerant of restarts, or support concurrent replicas then scaling up and down to meet demand is absolutely fine.

mkl95 · on March 21, 2024

The reality for most engineers is that their CTOs stopped caring about tech somewhere between the late 90s and mid 2000s. You'll have to put up with processes designed by some dude who still views platform orgs as a bunch of sysadmins and webmasters.

almostdeadguy · on March 21, 2024

Treating performance and reliability (which is inescapably impacted upon by performance characteristics) as externalities is a great way to create perverse incentives for your engineering team.

Also this reads like a cry for help:

> Therefore, to keep our Kubernetes clusters optimized, it would necessitate mandating all teams to perpetually engage in complex manual optimization processes indefinitely, or until Mercari goes out of business.

beeboobaa3 · on March 21, 2024

Or you could learn the platform you are deploying your software to