Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Yes, the efficiency gains of remote automated administration and deployment make up for most outages that are caused by it.

A better thing to do is do phased deployment, so you can see if an update will cause issues in your environment before pushing it to all systems. As this incident shows, you can’t trust a software vendor to have done that themselves.




This wasn't a binary patch though, it was a configuration change that was fed to every device. Which raises a LOT of questions about how this could have happened and why it wasn't caught sooner.


Writing from the SRE side of the discipline, it's commonly a configuration change (or a "flag flip") that ultimately winds up causing an outage. All too seldom are configuration data considered part of the same deployable surface area (and, as a corollary, part of the same blast radius) as program text.

I've mostly resigned myself today to deploying the configuration change and watching for anomalies in my monitoring for a number of hours or days afterward, but I acknowledge that I also have both a process supervisor that will happily let me crash loop my programs and deployment infrastructure that will nonetheless allow me to roll things back. Without either of those, I'm honestly at a loss as to how I'd safely operate this product.


  # Update A
  
  ## config.ext
  
  foo = false
  
  ## src.py
  
  from config import config
  
  if config('foo'):
      work(2 / 0)
  else:
      work(10 / 5)
"Yep, we rigorously tested it."

  # Update B
  
  ## config.ext
  
  foo = true
"It's just a config change, let's go live."


Yeah, that's about right.

The most insidious part of this is when there are entire swaths of infrastructure in place that circumvent the usual code review process in order to execute those configuration changes. Boolean flags like your `config('foo')` here are most common, but I've also seen nested dictionaries shoved through this way.


When I was at FB there were a load of SEVs caused by config changes, such that the repo itself would print out a huge warning about updating configs and show you how to do a canary to avoid this problem.


As in, there was no way to have configured the sensors to prevent this? They were just going to get this if they were connected to the internet? If I was an admin that would make me very angry.




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: