I see this as a problem of not investing enough in the deploy process. (Disclosu...

hinkley · on Sept 10, 2023

I wrote a tool to automate our hotfix process, and people were somewhat surprised that you could kill the process at any step and start over and it would almost always do the right thing. Like how did you expect it to work? Why replace an error prone process with an error prone and opaque one that you can't restart?

btown · on Sept 11, 2023

> the ability to quickly rollback to a prior known good stage

This is vital, but it's often not sufficient just to roll back, say, to a known good Docker image. Database migrations may have occurred that dropped columns that the old code expects to exist; feature flags may need to be changed; multiple services may need to be rolled back individually; data may have accumulated under new assumptions that breaks old assumptions when old code is applied to that new data.

One of the really subtle wins of devops as a discipline is that by allowing/forcing application teams to take responsibility for deployment, they're more exposed to thinking how to solve these things in a maintainable way: for instance, breaking out complex "the meaning of our data is changing"-type changesets/data migrations into individually reversible stages, with the stages merged onto the production branch over the course of multiple days where analysis is done on error rates and live data.

bluelightning2k · on Sept 10, 2023

Great reply.

Counterpoint though: that automation in and of itself is more failure area.

I can imagine a similar story where the deployment pipeline incorrectly rolled back due to some change in metric format and caused the infinite loss, for example.

The thing with these being a 1 in a million chance is that there's thousands of different hypothetical causes. The more parts the harder to predict an interaction and we've all been blindsided by something.

I would personally hate the stress of working on such high stakes releases.

schneems · on Sept 11, 2023

Test test test. If that’s not enough, pick better tools. I’m rewriting bash scripts in rust at work because it gives me the ability to make many invalid states impossible to represent in code. Is it overkill? Maybe, but it is such a huge quality of life improvement.

Automated things can fail. Sure. But consider that playbooks are just crappy automation run by unreliable meat computers.

Also you can take an iterative approach to automation:

- manual playbook only

- automate one step of the playbook

- if it goes well, move to another. If not, run a retro to figure out out how you can improve it and try again.

Stress of failure at a job responsible for deployment architecture is manageable if you have a team and culture built around respecting that stress. There are some areas of code people are more careful around, but largely we make safety a product of our tools and processes and not some heroic “try harder not to screw up” attitude.

I find the impact of helping so many developers and their companies rewarding.

BoorishBears · on Sept 11, 2023

Part of the solution is at the level of attitude, just one more productive than "don't fuck up"

To create a contrived example, say someone reads your note on replacing bash scripts and decides they agree with the principle.

They go into work tomorrow, their fellow engineer agrees on the technical merit, and they reimplement a bunch of bash scripts in Rust with a suite of tests bigger than anyone imagined, and life is great.

... fast forward a few months from now and suddenly a state the bash scripts were hiding flares up and everyone is lost, and type safety didn't help.

A shared culture of "conservation of value" can help in a lot of ways there. That's the attitude that creation of value is always uncertain, so you prioritize potential future value lower than currently provided value:

- instead of looking at technical merit of the new, we prioritize asking: What specific shortcomings the old way have? What can we improve downstream so that the value those systems provide is protected from invalid states we're worried about this tool generating?

- does switching the language reduce the number of people who can work on it? Do we reduce the effect surface area of the team providing value to it? When hair is on fire do we know the sysops guy won't balk?

- when it goes down, with a culture of "conservation of value", your plan A is always rolling back, there's no back and forth on if we can just roll this one fix. If you cause the company to lose a million dollar trade, it's already codified that you made the right decision

Obviously these are all extensions of a contrived example, but to me culture is heavily utilized as a way to guide better engineering.

I think these days people tend to think in terms of culture that affirms, as a reaction to cultures that block anyone from accomplishing anything: to me a good engineering culture is one that clashes with what people want to do just enough to be mildly annoying.

hooverd · on Sept 10, 2023

> bash scripts in a trench coat That's an amazing turn of phrase.