Maybe a silly question, but why all this engineering effort when you could host the dev environment locally?
By running a Linux VM on your local machine you get a consistent environment that you can ssh to, remove the latency issues but you remove all the complexity of syncing that they’ve created.
That’s a setup that’s worked well for me for 15 years but maybe I’m missing some other benefit?
I work on this at Stripe. There's a lot of reasons:
* Local dev has laptop-based state that is hard to keep in sync for everyone. Broken laptops are _really hard_ to debug as opposed to cloud servers I can deploy dev management software to. I can safely say the oldest version of software that's in my cloud; the laptops skew across literally years of versions of dev tools despite a talented corpeng team managing them.
* Our cloud servers have a lot more horsepower than a laptop, which is important if a dev's current task involves multiple services.
* With a server, I can get detailed telemetry out of how devs work and what they actually wait on that help me understand what to work on next; I have to have pretty invasive spyware on laptops to do the same.
* Servers in our QA environment can interact with QA services in a way that is hard for a laptop to do. Some of these are "real services", others are incredibly important to dev itself, such as bazel caches.
There's other things; this is an abbreviated list.
If a linux VM works for you, keep working! But we have not been able to scale a thousands-of-devs experience on laptops.
I want to double check we’re talking about the same thing here. I’m referring to running everything inside a single VM that you would have total access to. It could have telemetry, you’d know versions etc. I wonder if there’s some confusion around what I’m suggesting given your points above.
I’m sure there are a bunch of things that make it the right choice for Stripe. Obviously if you just have too many things to run at a time and a dev laptop can’t handle it then it’s a dealbreaker. What’s the size of the cloud instances you have to run on?
> I’m referring to running everything inside a single VM that you would have total access to. It could have telemetry, you’d know versions etc. I wonder if there’s some confusion around what I’m suggesting given your points above.
I don't think there's confusion. I only have total access when the VM is provisioned, but I need to update the dev machine constantly.
Part of what makes a VM work well is that you can make changes and they're sticky. Folks will edit stuff in /etc, add dotfiles, add little cron jobs, build weird little SSH tunnels, whatever. You say "I can know versions", but with a VM, I can't! Devs will run update stuff locally.
As the person who "deploys" the VM, I'm left in a weird spot after you've made those changes. If I want to update everyone's VM, I blow away your changes (and potentially even the branches you're working on!). I can't update anything on it without destroying it.
In constrast, the dev servers update constantly. There's a dozen moving parts on them and most of them deploy several times a day without downtime. There's a maximum host lifetime and well-documented hooks for how to customize a server when it's created, so it's clear how devs need to work with them for their customizations and what the expectations are.
I guess its possible you could have a policy about when the dev VM is reset and get developers used to it? But I think that would be taking away a lot of the good parts of a VM when looking at the tradeoffs.
> What’s the size of the cloud instances you have to run on?
We have a range of options devs can choose, but I don't think any of them are smaller than a high-end laptop.
So the devs don’t have the ability to ssh to your cloud instances and change config? Other than the size issue, I’m still not seeing the difference. Take your point on it needing to start before you have control, but other than that a VM on a dev machine is functionally the same as one in a cloud environment.
In terms of needing to reset, it’s just a matter of git branch, push, reset, merge. In your world that sync complexity happens all the time, in mine just on reset.
Just to be clear, I think it’s interesting to have a healthy discussion about this to see where the tradeoffs are. Feels like the sort of thing where people try to emulate you and buy themselves a bunch of complexity where other options are reasonable.
I have no doubt Stripe does what makes sense for Stripe. I’d also wager than on balance it’s not the best option for most other teams.
PS thanks for chiming in. I appreciate the extra insights and context.
> So the devs don’t have the ability to ssh to your cloud instances and change config?
They do, but I can see those changes if I'm helping debug, and more importantly, we can set up the most important parts of the dev processes as services that we can update. We can't ssh into a VM on your laptop to do that.
For example, if you start a service on a stripe machine, you're sending an RPC to a dev-runner program that allocates as many ports as are necessary, updates a local envoy to make it routable, sets up a systemd unit to keep it running, and so forth. If I need to update that component, I just deploy it like anything else. If someone configures their host until that dev runner breaks, it fails a healthcheck and that's obvious to me in a support role.
> Just to be clear, I think it’s interesting to have a healthy discussion about this to see where the tradeoffs are. Feels like the sort of thing where people try to emulate you and buy themselves a bunch of complexity where other options are reasonable.
100% Agree! I think we've got something pretty cool, but this stuff is coming from a well-resourced team; keeping the infra for it all running is larger than many startups. There's tradeoffs involved: cost, user support, flexibility on the dev side (i.e. it's harder to add something to our servers than to test out a new kind of database on your local VM) come immediately to mind, but there are others.
There are startups doing lighter-weight, legacy-free versions of what we're doing that are worth exploring for organizations of any size. But remote dev isn't the right call for every company!
Ah! So that’s a spot where we’re talking past each other.
I’d anticipate you would be equally as able to ssh to VMs on dev laptops. That’s definitely a prerequisite for making this work in the same way as you’re currently doing.
The only difference between what you do and what I’m suggesting is the location of the VM. That itself creates some tradeoffs but I would expect absolutely everything inside the machine to be the same.
> I’d anticipate you would be equally as able to ssh to VMs on dev laptops. That’s definitely a prerequisite for making this work in the same way as you’re currently doing.
Our laptops don't receive connections, but even if they could, folks go on leave and turn them off for 9 months at a time, or they don't get updated for whatever reason, or other nutty stuff.
It's surprisingly common with a few thousand of them out there that laptop management code that removes old versions of a tool is itself removed after months, but laptops still pop up with the old version as folks turn them back on after a very long time, and the old tool lingers. The services the tools interact with have long since stopped working with the old version, and the laptop behaves in unpredictable ways.
This doesn't just apply to hypothetical VMs, but various CLI tools that we deploy to laptops, and we still have trouble there. The VMs are just one example, but a guiding principle for us been that the less that's on the laptop, the more control we have, and thus the better we can support users with issues.
Maybe I'm missing something here but couldn't you just track the whole VM setup (dependencies, dev tools, telemetry and everything) in your monorepo? That is, the VM config would get pulled from master just like everything else, and then the developer would use something like nixos-shell[0] to quickly fire up a VM based on that config that they pulled.
Yes, but this still "freezes" the VM when the user creates it, and I've got no tools to force the software running in it to be updated. It's important that boxes can be updated, not just reliably created.
As just one reason why, many developers need to set up complex test data. We have tools to help with that, but they take time to run and each team has their own needs, so some of them still have manual steps when creating a new dev server. These devs tend to re-use their servers until our company-wide max age. Others, to be fair, spin up a new machine for every branch, multiple times per day, and spinning up a new VM might not be burdensome for them.
Isn't this a matter of not reusing old VMs after a `git pull/checkout`, though? (So not really different from updating any other project dependencies?) Moreover, shouldn't something like nixos-shell take care of this automatically if it detects the VM configuration (Nix config) has changed?
> Isn't this a matter of not reusing old VMs after a `git pull/checkout`, though?
Yes, but forcing people to rebase is disruptive. Master moves several times per minute for us, so we don't want people needing to upgrade as the speed of git. Some things you have to rebase for: the code you're working on. Other things are the dev environment around your code, and you don't want that to be part of the checkout as much as possible. And as per my earlier comment, setting up a fresh VM can be quite expensive in terms of developer time if test data needs to be configured.
You seem to assume you would have to rebuild the entire VM whenever any code in git changes in any way. I don't think you do: You could simply mount application code (and test data) inside the VM. In my book, the VM would merely serve to pin the most basic dependencies for running your integration / e2e tests and I don't think those would change often, so triggering a VM rebuild should produce a cache hit in 99% of the cases.
I think this is where our contexts may differ, and so we end up with different tradeoffs and choices :) The services running on our dev servers are updated dozens of times per day, and they roughly correspond to the non-code parts of a VM.
Or maybe we just used terminology differently. :) Why wouldn't those services be part of the code? After all, I thought we were talking about a monorepo here.
I see in another comment thread you mentioned downloading the VM iso, presumably from a central source. Your comment in this thread didn't mention that so perhaps this answer (incorrectly) assumes the VM you are talking about was locally maintained/created?
To provide historical context, 10 years ago there was a local dev infrastructure, but it was already so creaky as to be unreliable. Just getting the ruby dependencies updated was a problem.
The local dev was also already cheating: All the asynchronous work that was triggered via RabbitMQ/Kafka was getting hacked together, because trying to run everything that Infra/Queues did locally would have been very wasteful. So magic occurred in the calls to the message queue that instead triggered the crucial ruby code that would be hit in the end.
So if this was a problem back then, when the company had less than 1000 employees, I can't even imagine how hard would it be to get local dev working now
Sounds like you made a massive tradeoff in code coupling if your cant easily swap out remote for local queues etc. But i get it, when your thinking cloud first, understanding where your abstractions start or end can be a complex topic that creates flow on effects and often stop the whizz bang cloud demo code from copy/paste working in your solution. Depending on the stage of your company, this could be a feature or a bug. maybe you have so much complexity in your solution from spreading buisness logic across services that your solution only makes sense when your developing against prod-like-infra and in that scenario im seeing a benifit of having cloud first dev infra because keeping that beast tamed otherwise would be a monumental challange given the perchant for cloud-first to be auto-update-everything.
The way these problems are stated mighy make it seem like they're unsolvable without a lot of effort. I just want to point out that I've worked at places that do use a local, supported environment, and it works well.
Not saying it's the wrong choice for you, but it's a choice, not a natural conclusion.
In my opinion the single most important feature of any development environment is a reliable “reset” button.
The amount of time companies lose to broken development environments is incredible. A developer can easily lose half a day (or more) of productive time.
With cloud environments it’s much easier to offer a “just give me a brand new environment that works” button somewhere. That’s incredibly valuable.
For sure, but, a VM has that feature too. They have to run some services directly on the laptop to handle the code syncing. So if you accept a certain amount of “need to do some dev machine setup” as a cost, installing Parallels and running a script to download an iso is a pretty small surface area that allows for a full reset.
I don’t doubt that Stripe have a setup that works well for them them but I also bet they could have gone done a different path that also worked well and I suspect that other path (local VMs) is a better fit for most other smaller teams.
From what I remember (left Stripe in late 2022) much of Stripe's codebase was/is a Ruby tangled "big ball of mud" monorepo due to lack of proper modules. Basically a lot of the core modules all imported code from each other with little layering so you couldn't deploy a lean service without pulling in almost all of the monorepo code. And due to the way imports worked it would load a ton of this code a runtime. This meant that even a simple service would have extremely high memory usage and be unsuitable for a local dev environment where you have N of these bloated services running at the same time. There was a big refactoring effort to get "strict modules" in place to cut down on this bloat which had some promising results. I'm not an expert in this area but I believe this was the gist of it.
You're limited by the resources available to you on your local laptop and when you close that laptop the dev environment stops running. Remote dev environments are more costly and complicated to maintain but they can be shared, can scale vertically (or horizontally) on demand, can persist when you exit them, and managing access to various internal services from dev environments can in some cases be simpler.
It also centralizes dev environment management to the platform team that owns them and provides them as a service which cuts down on support tickets related to broken dev environments. There are certainly some trade offs though and for most companies a local VM or docker compose file will be a better choice.
Also tends to security advantages to mitigate/manage dev risks. Typically hosts will have security tooling installed (AV, EDR, etc) that may not be installed on local VMs, hosts are ephemeral so quickly created and destroyed, network restrictions, etc.
Not even once did I want to share my dev. environment, nor did anyone want to share mine. We are talking about 25-odd years of being a developer.
Never in my life did I want to scale my dev. environment vertically or horizontally or in any other direction. Unless you work on a calculator, I don't know why would you need that.
I have no problems with my environment stopping when I close my laptop. Why is this a problem for anyone?
For overwhelming majority of programming projects out there they fit on a programmer's laptop just fine. The rare exceptions are the projects which require very specialized equipment not available to the developers. In any case, a simulator would be usually a preferable way to dealing with this, and the actual equipment would be only accessed for testing, not for development. Definitely not as a routine development process.
Never in my life did I want development process to be centralized. All developers have different habits, tastes and preferences. Last thing I want is to have centralized management of all environments which would create unwanted uniformity. I've been only once in a company that tried to institute a centrally-managed development environment in the way you describe, and I just couldn't cope with it. I quit after few month of misery. The most upsetting aspect about these efforts is stupidity. These efforts solve no problems, but add a lot of pain that is felt continuously, all the time you have to do anything work-related.
I get a serious feeling that interpreted languages, monorepos, environment orchestration, snapshot ecosystem aggregators, and per-function execution evironments are all pushing software development into the wrong direction.
Those things are not bad by themselves. But people tend to do bad things with them, and those bad things spread remarkably well, disrupting every place they infect.
I'm not sure why monorepos are in the list. Care to elaborate?
I've worked on projects that used a single repository for all the code written by different departments and projects where the same department could have multiple repositories. The later added insane amount of busy work, inordinate amount of errors, difficulty to investigate failures, excessive use of resources to house various permutations of systems created at different times with different combinations of components. The day-to-day in such projects could be described by developers waiting for the infra people to sort out the morning problems which mysteriously broke everything all at once so that no progress can be made.
This was in stark contrast to companies working on a single repository, where days when nothing worked would happen maybe once or twice a year.
I also lived through transitions from multiple repositories to a single repository and the other way around. In operational terms, I've never seen any beneficial effects of splitting a repository. Not in the short, nor in the long term. Complexity always went up, productivity went down, general satisfaction with project infrastructure would also go down with such a change. Department would start attacking and blaming the infra people for creating obstacles to their progress (while never explicitly mentioning the split repository because, usually that was a decision made by the same people complaining).
Oh, yeah, all of those issues of enforcing transitive dependencies that need busy work to update, fluid APIs that make all the code around it break, lack of semantic boundaries that make it hard to decide if a problem is local, inter-component interference so that you have to select them perfectly well...
All of those are enabled by monorepos. And once people learn to do them, they seem to want to apply everywhere.
The absurd lengths people will go to avoid learning how computers actually work because they fell for the buy now, pay later promise of 'easy' development.
Talking form a perspective of someone who worked at Google and one other similar company that shell remain nameless... as well as simply looking at places like Github where people tend to post projects they are working on: I don't know of any Github project that would be even in the size range to cause any discomfort for a laptop user.
Even when it comes to the larger projects: I have multiple checkouts of GCC and Linux kernel on my laptop, and when I run du their existence doesn't even register in the first dozen results... Of course, proprietary projects tend to be on a bigger side due to putting a lot of not-strictly code-related stuff in a repository, but still... it would have to be billions LoC big to be prohibitively big for a typical laptop.
If you have 100 services in your org, I don't have to have 100 running at the same time in your local dev machine. I only run the 5 I need for the feature I'm working on.
We have 100 Go services (with redpanda) and a few databases in docker-compose on dev laptops. It works well when and we buy the biggest memory MacBooks available.
Your success with this strategy correlates more strongly with ‘Go’ than ‘100 services’ so it’s more anecdotal than generally-acceptable that you can run 100 services locally without issues. Of course you can.
Buying the biggest MacBook available as a baseline criteria for being able to run a stack locally with Docker Compose does not exactly inspire confidence.
At my last company we switched our dev environment from Docker Compose to Nix on those same MacBooks and CPU usage when from 300% to <10% overnight.
Have any details on how you've implemented Nix? For my personal projects I use nix without docker and the results are great. However I was always fearful that nix alone wouldn't quite scale as well as nix + docker for complicated environments.
Hi Jason! Like many others here I'm looking forward to that blog post! :-)
For now, could you elaborate on what exactly you mean by transitioning from docker-compose to Nix? Did you start using systemd to orchestrate services? Were you still using Docker containers? If so, did you build the images with Nix? Etc.
When we used docker-compose we had a CLI tool which developers put in their PATH which was able to start/stop/restart services using the regular compose commands. This didn’t accomplish much at the time other than being easy to remember and not requiring folks to know where their docker-compose files were located. It also took care of layering in other compose files for overriding variables or service definitions.
Short version of the Nix transition: the CLI tool would instead start services using nix-shell invocations behind pm2. So devs still had a way to start services from anywhere, get logs or process status with a command… but every app was running 100% natively.
At the time I was there, containers weren’t used in production (they were doing “App” deploys still) so there was no Docker target that was necessary/useful outside of the development environment.
Besides the performance benefit, microservices owning their development environment in-repo (instead of in another repo where the compose configs were defined) was a huge win.
several nixy devtools do some process management now
something we're trying in Flox is per-project services run /w process-compose.
they automatically shutdown when all your activated shells exit, and it feels really cool
I've been on this path and as soon as you work on a couple of concurrent branches you end up having 20 containers in your machine and setting these up to run successfully ends up being its own special PITA.
What exactly are the problems created by having a larger number of containers? Since you’re mentioning branches, these presumably don’t have to all run concurrently, i.e, you’re not talking about resource limitations.
Large features can require changing protocols or altering schemas in multiple services. Different workflows can require different services, etc. Keep track of different service versions in a couple branchs (not unusual IMO) and it just becomes messy.
You could still run the proxy they have that lazy boots services - that’s a nice optimisation.
I don’t think that many places are in a position where the machines would struggle. They didn’t mention that in the article as a concern - just that they struggled to keep environments consistent (brew install implies some are running on osx etc).
I think it’s safe to assume that for something with the scale and complexity of Stripe, it would be a tall order to run all the necessary services on your laptop, even stubs of them. They may not even do that on the dev boxes, I’d be a little surprised if they didn’t actually use prod services in some cases, or a canary at any rate, to avoid the hassles of having to maintain on-call for what is essentially a test environment.
I don’t know that’s safe to assume. Maybe it is an issue but it was not one of the issues they talk about in the article and not one of the design goals of the system. They have the proxy / lazy start system exactly so they can limit the services running. That suggests to me that they don’t end up needing them all the time to get things done.
Working in a configuration where your development environment isn't on your computer is always a huge downgrade. Work with VM? -- sooner or later you'll have problems with forwarding your keyboard input to the VM. Work with containers? -- no good way to save state, no good way to guarantee all containers are in sync etc. God forbid any sort of Web browser-based solution. The number of times I accidentally closed the tab or did something else unintentionally because of key mapping that's impossible to modify...
However, in some situations you must endure the pain of doing this. For example, regulatory reasons. Some organizations will not allow you to access their data anywhere but on some cloud VM they give you very botched and very limited control over. While, technically, these are usually easy to side-step, you are legally required to not move the data outside of the boundaries defined for you by the IT. And so you are stuck in this miserable situation, trying to engineer some semblance of a decent utility set in a hostile environment.
Another example is when the infrastructure of your project is too vast to be meaningfully reduced to your laptop, and a lot of your work is exploratory in nature. I.e. instead of typical write-compile-upload-test you are mostly modifying stuff on the system you are working on to see how it responds. This is kind of how my day-to-day goes: someone reported they fail to install or use one of the utilities we provide in a particular AWS region with some specific network settings etc. They'd give me a tunnel to the affected cluster, and I'd have some hours to spend there investigating the problem and looking for possible immediate and long-term solutions. So, you are essentially working in a tech-support role, but you also have to write code, debug it, sometimes compile it etc.
What you describe isn't a development process in a remote environment. You are testing on some remote compute resource. Testing is a non-essential part of development, so, in a sense it "doesn't count" that you test somewhere else -- you cannot call it "developing in a remote environment".
Otherwise, you could say that, for example, reading documentation on a Web page you are doing "development in a remote environment" because, well... most likely that Web page isn't hosted on your laptop.
The essential and mandatory part of development is that a program is written. If you write the program on your laptop, you aren't doing "remote development", no matter where other tools you use for development are running.
The year of Linux on the laptop has yet to arrive for most of us. Windows and MacOS both offer better battery life, if for no other reason (and there are usually other reasons, like suspend/wake issues, graphics driver woes, etc.)
Agreed. It's so much simpler when people run Linux locally too. Most of our dev environment problems are from people who don't. When you run it locally you also get good at using it which, unsurprisingly, helps a lot when you have to figure out a problem with the deployed version. Learning MacOS/Windows is kinda pointless knowledge in the long run.
By running a Linux VM on your local machine you get a consistent environment that you can ssh to, remove the latency issues but you remove all the complexity of syncing that they’ve created.
That’s a setup that’s worked well for me for 15 years but maybe I’m missing some other benefit?