Yes, the efficiency gains of remote automated administration and deployment make...

jmcgough · 2024-07-19T20:52:48 1721422368

This wasn't a binary patch though, it was a configuration change that was fed to every device. Which raises a LOT of questions about how this could have happened and why it wasn't caught sooner.

nrr · 2024-07-20T01:09:04 1721437744

Writing from the SRE side of the discipline, it's commonly a configuration change (or a "flag flip") that ultimately winds up causing an outage. All too seldom are configuration data considered part of the same deployable surface area (and, as a corollary, part of the same blast radius) as program text.

I've mostly resigned myself today to deploying the configuration change and watching for anomalies in my monitoring for a number of hours or days afterward, but I acknowledge that I also have both a process supervisor that will happily let me crash loop my programs and deployment infrastructure that will nonetheless allow me to roll things back. Without either of those, I'm honestly at a loss as to how I'd safely operate this product.

Cyphase · 2024-07-20T06:09:32 1721455772

  # Update A
  
  ## config.ext
  
  foo = false
  
  ## src.py
  
  from config import config
  
  if config('foo'):
      work(2 / 0)
  else:
      work(10 / 5)

"Yep, we rigorously tested it."

  # Update B
  
  ## config.ext
  
  foo = true

"It's just a config change, let's go live."

nrr · 2024-07-20T08:22:29 1721463749

Yeah, that's about right.

The most insidious part of this is when there are entire swaths of infrastructure in place that circumvent the usual code review process in order to execute those configuration changes. Boolean flags like your `config('foo')` here are most common, but I've also seen nested dictionaries shoved through this way.

disgruntledphd2 · 2024-07-20T10:10:58 1721470258

When I was at FB there were a load of SEVs caused by config changes, such that the repo itself would print out a huge warning about updating configs and show you how to do a canary to avoid this problem.

Vegenoid · 2024-07-19T21:46:39 1721425599

As in, there was no way to have configured the sensors to prevent this? They were just going to get this if they were connected to the internet? If I was an admin that would make me very angry.