Unfortunately until crappy and slow performing software starts losing business t...

DeathArrow · on March 26, 2023

Entirely my experience. I worked for a very large company and since our microservices were running in Kubernetes it was totally OK for the services to crash. We just logged the things in elastic stack, used data dog to monitor, used some SRE to restart services when things were messy.

I've lost days to solve a memory bug in one of the services and no one cared.

The original architect and programmers were gone, the engineering managers were pushing Clean, SOLID and design patterns hopping that will help and the domain was hard because we had to deal with real money while respecting tens of different laws and locales.

We tried at our best to do the impossible, we were understaffed by a factor of 5X.

And the services were crashing like crazy but it looked like a no downtime because we provisioned another POD in Kubernetes.

Provisioning was like: how many pods do we need? 3? Let's make it nine. How much memory do we need? Half a gig? Let's make it 4 gig to be sure.

So it was a mess and it worked and that mess continues to work somehow.

Not sure if mess driving engineering is a sound bussineses idea, though.

jacquesm · on March 26, 2023

Erlang has taken 'let it crash' to entirely new heights, but that's from a top level viewpoint, components are allowed to crash but the service isn't. And you're still supposed to figure out why your service crashed but I'm pretty sure not everybody does that.

So as long as your supervision mechanism is bulletproof (supervision trees) then you can get away with this for very long.