Maybe its normal for a company this size, but I have a hard time following much ...

ianvonseggern · on Aug 8, 2024

Hey, author here, I think you ask a good question and I think you frame it well. I agree that, at least for some major decisions - including this one, "it is fundamentally difficult for organizations of this scale to identify and reason about valuable work."

At its core we are a platform teams building tools, often for other platform teams, that are building tools that support the developers at Figma creating the actual product experience. It is often harder to reason about what the right decisions are when you are further removed from the end user, although it also gives you great leverage. If we do our jobs right the multiplier effect of getting this platform right impacts the ability of every other engineer to do their job efficiently and effectively (many indirectly!).

You bring up good examples of why this is hard. It was certainly an alternative to say sorry we can't support etcd and helm and you will need to find other ways to work around this limitation. This was simply two more data points helping push us toward the conclusion that we were running our Compute platform on the wrong base building blocks.

While difficult to reason about, I do think its still very worth trying to do this reasoning well. It's how as a platform team we ensure we are tackling the right work to get to the best platform we can. Thats why we spent so much time making the decision to go ahead with this and part of why I thought it was an interesting topic to write about.

felixgallo · on Aug 8, 2024

I have a constructive recommendation for you and your engineering management for future cases such as this.

First, when some team says "we want to use helm and etcd for some reason and we haven't been able to figure out how to get that working on our existing platform," start by asking them what their actual goal is. It is obscenely unlikely that helm (of all things) is a fundamental requirement to their work. Installing temporal, for example, doesn't require helm and is actually simple, if it turns out that temporal is the best workflow orchestrator for the job and that none of the probably 590 other options will do.

Second, once you have figured out what the actual goal is, and have a buffet of options available, price them out. Doing some napkin math on how many people were involved and how much work had to go into it, it looks to me that what you have spent to completely rearchitect your stack and operations and retrain everyone -- completely discounting opportunity cost -- is likely not to break even in even my most generous estimate of increased productivity for about five years. More likely, the increased cost of the platform switch, the lack of likely actual velocity accrual, and the opportunity cost make this a net-net bad move except for the resumes of all of those involved.

Spivak · on Aug 8, 2024

> we can't support etcd and helm and you will need to find other ways to work around this limitation

So am I reading this right that either downstream platform teams or devs wanted to leverage existing helm templates to provision infrastructure and being on ECS locked you out of those and the water eventually boiled over. If so that's a pretty strong statement about the platform effect of k8s.

vouwfietsman · on Aug 9, 2024

Hi! Thanks for the thoughtful reply.

I understand what you're saying, the thing that worries me though is that the input you get from other technical teams is very hard to verify. Do you intend to measure the development velocity of the teams before and after the platform change takes effect?

In my experience it is extremely hard to measure the real development velocity (in terms of value-add, not arbitrary story points) of a single team, not to mention a group of teams over time, not to mention as a result of a change.

This is not necessarily criticism of Figma, as much as it is criticism of the entire industry maybe.

Do you have an approach for measuring these things?

felixgallo · on Aug 9, 2024

You're right that the input from other technical teams is hard to verify. On the other hand, that's fundamental table stakes, especially for a platform team that has a broad impact on an organization. The purpose of the platform is to delight the paying customer, and every change should have a clear and well documented and narrated line of sight to either increasing that delight or decreasing the frustration.

The canonical way to do that is to ensure that the incoming demand comes with both the ask and also the solid justification. Even at top tier organizations, frequently asks are good ideas, sensible ideas, nice ideas, probably correct ideas -- but none of that is good enough/acceptable enough. The proportion of good/sensible/nice/probably correct ideas that are justifiable is about 5% in my lived experience of 38 years in the industry. The onus is on the asking team to provide that full true and complete justification with sufficiently detailed data and in the manner and form that convinces the platform team's leadership. The bar needs to be high and again, has to provide a clear line of sight to improving the life of the paying customer. The platform team has the authority and agency necessary to defend the customer, operations and their time, and can (and often should) say no. It is not the responsibility of the platform team to try to prove or disprove something that someone wants, and it's not 'pushing back' or 'bureaucracy', it's basic sober purpose-of-the-company fundamentals. Time and money are not unlimited. Nothing is free.

Frequently the process of trying to put together the justification reveals to the asking team that they do not in fact have the justification, and they stop there and a disaster is correctly averted.

Sometimes, the asking team is probably right but doesn't have the data to justify the ask. Things like 'Let's move to K8s because it'll be better' are possibly true but also possibly not. Vibes/hacker news/reddit/etc are beguiling to juniors but do not necessarily delight paying customers. The platform team has a bunch of options if they receive something of that form. "No" is valid, but also so is "Maybe" along with a pilot test to perform A/B testing measurements and to try to get the missing data; or even "Yes, but" with a plan to revert the situation if it turns out to be too expensive or ineffective after an incrementally structured phase 1. A lot depends on the judgement of the management and the available bandwidth, opportunity cost, how one-way-door the decision is, etc.

At the end of the day, though, if you are not making a data-driven decision (or the very closest you can get to one) and doing it off naked/unsupported asks/vibes/resume enhancement/reddit/hn/etc, you're putting your paying customer at risk. At best you'll be accidentally correct. Being accidentally correct is the absolute worst kind of correct, because inevitably there will come a time when your luck runs out and you just killed your team/organization/company because you made a wrong choice, your paying customers got a worse/slower-to-improve/etc experience, and they deserted you for a more soberly run competitor.

WaxProlix · on Aug 8, 2024

People move to K8s (specifically from ECS) so that they can use cloud provider agnostic tooling and products. I suspect a lot of larger company K8s migrations are fueled by a desire to be multicloud or hybrid on-prem, mitigate cost, availability, and lock-in risk.

zug_zug · on Aug 8, 2024

I've heard all of these lip-service justifications before, but I've yet to see anybody actually publish data showing how they saved any money. Would love to be proven wrong by some hard data, but something tells me I won't be.

bryanlarsen · on Aug 8, 2024

I'm confident Figma isn't paying published rates for AWS. The transition might have helped them in their rate negotiations with AWS, or it might not have. Hard data on the money saved would be difficult to attribute.

vundercind · on Aug 8, 2024

The vast majority of corporate decisions are never justified by useful data analysis, before or after the fact.

Many are so-analyzed, but usually in ways that anyone who paid attention in high school science or stats classes can tell are so flawed that they’re meaningless.

We can’t even measure manager efficacy to any useful degree, in nearly all cases. We can come up with numbers, but they don’t mean anything. Good luck with anything more complex.

Very small organizations can probably manage to isolate enough variables to know how good or bad some move was in hindsight, if they try and are competent at it (… if). Sometimes an effect is so huge for a large org that it overwhelms confounders and you can be pretty confident that it was at least good or bad, even if the degree is fuzzy. Usually, no.

Big organizations are largely flying blind. This has only gotten worse with the shift from people-who-know-the-work-as-leadership to professional-managers-as-leadership.

tengbretson · on Aug 8, 2024

There are large swaths of the b2b space where (for whatever reason) being in the same cloud is a hard business requirement.

imtringued · on Aug 10, 2024

There are good technical reasons for this. Anything latency or throughput sensitive is better done within the same datacenter. There have been submissions about an ffmpeg as a service company and a GPU over TCP company on HN recently that would significantly benefit from 'same cloud'.

jgalt212 · on Aug 8, 2024

True but if AWS knows your lock-in is less locked-in, I'd bet they'd more flexible when contracts are up for renewal. I mean it's possible the blog post's primary purpose was a shot across bow to their AWS account manager.

logifail · on Aug 8, 2024

> it's possible the blog post's primary purpose was a shot across bow to their AWS account manager

Isn't it slightly depressing that this explanation is fairly (the most?) plausible?

jiggawatts · on Aug 8, 2024

Our state department of education is one of the biggest networks in the world with about half a million devices. They would occasionally publicly announce a migration to Linux.

This was just a Microsoft licensing negotiation tactic. Before he was CEO, Ballmer flew here to negotiate one of the contracts. The discounts were epic.

WaxProlix · on Aug 8, 2024

It looks like I'm implying that companies are successful in getting those things from a K8s transition, but I wasn't trying to say that, just thinking of the times when I've seen these migrations happen and relaying the stated aims. I agree, I think it can be a burner of dev time and a burden on the business as devs acquire the new skillset instead of doing more valuable work.

nailer · on Aug 8, 2024

Likewise. I'm not sure Kubernetes famous complexity (and the resulting staff requirements) are worth it to preemptively avoid vendor lockin, and wouldn't be solved more efficiently by migrating to another cloud provider's native tools if the need arises.

Alupis · on Aug 8, 2024

Why would you assume it's lip-service?

Being vendor-locked into ECS means you must pay whatever ECS wants... using k8s means you can feasibly pick up and move if you are forced.

Even if it doesn't save money today it might save a tremendous amount in the future and/or provide a much stronger position to negotiate from.

greener_grass · on Aug 8, 2024

Great in theory but in practice when you do K8s on AWS, the AWS stuff leaks through and you still have lock-in.

cwiggs · on Aug 8, 2024

It doesn't have to be that way though. You can use the AWS ingress controller, or you can use ingress-nginx. You can use external secrets operator and tie it into AWS Secrets manager, or you can tie it into 1pass, or Hashicorp Vault.

Just like picking EKS you have to be aware of the pros and cons of picking the cloud provider tool or not. Luckily the CNCF is doing a lot for reducing vender lock in and I think it will only continue.

Alupis · on Aug 8, 2024

Then don't use the AWS stuff. You can bring your own anything that they provide.

greener_grass · on Aug 9, 2024

This requires iron discipline. Maybe with some kind of linter for Terraform / kubectl it could be done.

otterley · on Aug 8, 2024

Amazon ECS is and always has been free of charge. You pay for the underlying compute and other resources (just like you do with EKS, too), but not the orchestration service.

elktown · on Aug 8, 2024

I don't understand why this "you shouldn't be vendor-locked" rationalization is taken at face value at all?

1. The time it will take to move to another cloud is proportional to the complexity of your app. For example, if you're a Go shop using managed persistence are you more vendor locked in any meaningful way than k8s? What's the delta here?

2. Do you really think you can haggle with the fuel-producers like you're MAERKS? No, you're more likely just a car driving around for a gas station with increasingly diminishing returns.

Alupis · on Aug 8, 2024

This year alone we've seen significant price increases from web services, including critical ones such as Auth. If you are vendor-locked into, say Auth0, and they increase their price 300%[1]... What choice do you have? What negotiation position do you have? None... They know you cannot leave.

It's even worse when your entire platform is vendor-locked.

There is nothing but upside to working towards a vendor-neutral position. It gives you options. Even if you never use those options, they are there.

> Do you really think you can haggle

At the scale of someone like Figma? Yes, they do negotiate rates - and a competent account manager will understand Figma's position and maximize the revenue they can extract. Now, if the account rep doesn't play ball, Figma can actually move their stuff somewhere else. There's literally nothing but upside.

I swear, it feels like some people are just allergic to anything k8s and actively seek out ways to hate on it.

[1] https://auth0.com/blog/upcoming-pricing-changes-for-the-cust...

elktown · on Aug 8, 2024

Why skip point 1 and do some strange tangent on a SaaS product unrelated to using k8s or not?

Most people looking into (and using) k8s that are being told the "you most avoid vendor lock in!" selling point are nowhere near the size where it matters. But I know there's essentially bulk-pricing, as we have it where I work as well. That it's because of picking k8s or not however is an extremely long stretch, and imo mostly rationalization. There's nothing saying that a cloud move without k8s couldn't be done within the same amount of time. Or that even k8s is the main problem, I imagine it isn't since it's usually supposed to be stateless apps.

Alupis · on Aug 8, 2024

The point was about vendor lock, which you asserted is not a good reason to make a move, such as this. The "tangent" about a SaaS product was to make it clear what happens when you build your system in such a way as-to become entirely dependent on that vendor. Just because Auth0 is not part of one of the big "cloud" providers, doesn't make it any less vendor-locky. Almost all of the vendor services offered on the big clouds are extremely vendor-locked and non-portable.

Where you buy compute from is just as big of a deal as where you buy your other SaaS' from. In all of the cases, if you cannot move even if you had to (ie. it'll take 1 year+ to move), then you are not in a good position.

Addressing your #1 point - if you use a regular database that happens to be offered by a cloud provider (ie. Postgres, MySQL, MongoDB, etc) then you can pick up and move. If you use something proprietary like CosmoDB, then you are stuck or face significant efforts to migrate.

With k8s, moving to another cloud can be as simple as creating an account and updating your configs to point at the new cluster. You can run every service you need inside your cluster if you wanted. You have freedom of choice and mobility.

> Most people looking into (and using) k8s that are being told the "you most avoid vendor lock in!" selling point are nowhere near the size where it matters.

This is just simply wrong, as highlighted by the SaaS example I provided. If you think you are too small so it doesn't matter, and decide to embrace all of the cloud vendor's proprietary services... what happens to you when that cloud provider decides to change their billing model, or dramatically increases price? You are screwed and have no options but cough up more money.

There's more decisions to make and consider regarding choosing a cloud platform and services than just whatever is easiest to use today - for any size of business.

I have found that, in general, people are afraid of using k8s because it isn't trivial to understand for most developers. People often mistakenly believe k8s is only useful when you're "google scale". It solves a lot of problems, including reduced vendor-lock.

watermelon0 · on Aug 8, 2024

I would assume that the migration from ECS to something else would be a lot easier, compared to migrating from other managed services, such as S3/SQS/Kinesis/DynamoDB, and especially IAM, which ties everything together.

OptionOfT · on Aug 8, 2024

Flexibility was a big thing for us. Many different jurisdictions required us to be conscious of where exactly data was stored & processed.

K8s makes this really easy. Don't need to worry whether country X has a local Cloud data center of Vendor Y.

Plus it makes hiring so much easier as you only need to understand the abstraction layer.

We don't hire people for ARM64 or x86. We have abstraction layers. Multiple even.

We'd be fooling us not to use them.

teyc · on Aug 8, 2024

People move to K8s so that their resumes and job ads are cloud provider agnostic. Peoples careers stagnate when their employers platform on a home baked tech, or on specific offerings from cloud providers. Employers find Mmoving to a common platform makes recruiting easier.

fazkan · on Aug 8, 2024

This, most of it, I think is to support on-prem, and cloud-flexibility. Also from the customers point of view, they can now sell the entire figma "box" to controlled industries for a premium.

timbotron · on Aug 8, 2024

there's a pretty direct translation from ECS task definition to docker-compose file

Flokoso · on Aug 8, 2024

Managing 500 or more VMS is a lot of work.

Aline the VM upgrade, auth, backup, log rotation etc.

With k8s I can give everyone a namespace, policies, volumes, have automatic log aggregation due to demon sets and k8s/cloud native stacks.

Self healing and more.

It's hard to describe how much better it is.

lmm · on Aug 9, 2024

> For instance: they want to go to k8s because they want to use etcd/helm, which they can't on ECS? Why do you want to use etcd/helm? Is it really this important? Is there really no other way to achieve the goals of the company than exactly like that?

I'm no fan of Helm, but there are surprisingly few good alternatives to etcd (i.e. highly available but consistent datastores, suitable for e.g. the distributed equivalent of a .pid file) - Zookeeper is the only one that comes to mind, and it's a real pain on the ops side of things, requiring ancient JVM versions and being generally flaky even then.

friendly_deer · on Aug 8, 2024

Here's a theory about why at least some of these come about:

https://lethain.com/grand-migration/

wg0 · on Aug 8, 2024

If you haven't broken down your software into 50+ different separate applications written in 15 different languages using 5 different database technologies - you'll find very little use for k8s.

All you need is a way to roll out your artifact to production in a roll over or blue green fashion after the preparations such as required database alterations be it data or schema wise.

imiric · on Aug 8, 2024

> All you need is a way to roll out your artifact to production in a roll over or blue green fashion after the preparations such as required database alterations be it data or schema wise.

Easier said than done.

You can start by implementing this yourself and thinking how simple it is. But then you find that you also need to decide how to handle different environments, configuration and secret management, rollbacks, failover, load balancing, HA, scaling, and a million other details. And suddenly you find yourself maintaining a hodgepodge of bespoke infrastructure tooling instead of your core product.

K8s isn't for everyone. But it sure helps when someone else has thought about common infrastructure problems and solved them for you.

mattmanser · on Aug 8, 2024

You need to remove a lot of things from that list. Almost all of that functionality is available in build tools that have been available for decades. I want to emphasize the DECADES.

And then all you're left with is scaling. Which most business do not need.

Almost everything you've written there is a standard feature of almost any CI toolchain, teamcity, Jenkins, Azure DevOps, etc., etc.

We were doing it before k8s was even written.

imiric · on Aug 9, 2024

> Almost all of that functionality is available in build tools that have been available for decades.

Build tools? These are runtime and operational concerns. No build tool will handle these things.

> And then all you're left with is scaling. Which most business do not need.

Eh, sure they do. They might not need to hyperscale, but they could sure benefit from simple scaling, autoscaling at peak hours, and scaling down to cut costs.

Whether they need k8s specifically to accomplish this is another topic, but every business needs to think about scaling in some way.

> Almost everything you've written there is a standard feature of almost any CI toolchain, teamcity, Jenkins, Azure DevOps, etc., etc.

Huh? Please explain how a CI pipeline will handle load balancing, configuration and secret management, and other operational tasks for your services. You may use it for automating commands that do these things, but CI systems are entirely decoupled from core infrastructure.

> We were doing it before k8s was even written.

Sure. And k8s isn't the absolute solution to these problems. But what it does give you is a unified set of interfaces to solve common infra problems. Whatever solutions we had before, and whatever you choose to compose from disparate tools, will not be as unified and polished as what k8s offers. It's up to you to decide the right trade-off, but I find the head-in-the-sand dismissal of it equally as silly as cargo culting it.

mattmanser · on Aug 11, 2024

There's an entire part of all the build tools that handles configuration and secret management. Here's an example:

https://learn.microsoft.com/en-us/azure/devops/pipelines/pro...

They've always had this stuff. Back in the day it used to be XML transforms. I know azure have added key vault in recent years as well so you can get the values from the key vault instead, but it seems a cynical cash grab to me.

For loading balancing, there's a reason the load balancing servers all have APIs.

Almost no business will save money scaling. Think about it. Most businesses are small and medium business that just don't have the volume of traffic to justify anything more than buying or renting a single/two servers vs the cost of having a scaling solution in place. There are only a small number of businesses (in the scheme of things) that benefit from scaling as you have to pay to do it. And even then the saving is often so miniscule compared to a dev salary you lose money paying people to even set it up and will NEVER recoup that cost.

The cost of even having a single meeting with 4 people about it is more than the saving over a year.

javaunsafe2019 · on Aug 8, 2024

But you do know which problems the k8s abstraction solves, right? Cause it has nothing to do with many languages nor many services but things like discovery, scaling, failover and automation …

wg0 · on Aug 9, 2024

If all you have one single application listening on port 8080 with SSL terminated elsewhere, why would you need so many abstractions in first place.

mplewis · on Aug 8, 2024

Yeah, all you need is a rollout system that supports blue-green! Very easy to homeroll ;)

wg0 · on Aug 9, 2024

Not easy, but already a solved problem.

samcat116 · on Aug 8, 2024

> I have a hard time following much of the decision making around these gigantic migrations or technology efforts because the decisions don't seem to come from any user or company need

I mean the blog post is written by the team deciding the company needs. They explained exactly why they can't easily use etcd on ECS due to technical limitations. They also talked about many other technical limitations that were causing them issues and increasing cost. What else are you expecting?