Ask HN: Best practices for self-healing apps?

jasode · on Feb 8, 2023

>I use go but general patterns are welcome.

The basic high-level pattern for self-healing is a process that "watches" the status of other processes and restarts failed ones:

- a "coordinator" process that acts as a "monitor" of other work processes and does regular health check pings. If work process isn't responding, the coordinator/monitor kills/restarts it.

- the work process are engineered to write "checkpoints" of status progress (status file on disk or entry to db, etc). They can have a thread that responds to network pings from coordinator

A lot of systems use the pattern above. E.g. Oracle RDBMS has "PROCMON" process (literally "PROCess MONitor") to look for hung SQL query processes. Erlang/OTP has a "supervisor" that kills/restarts processes. Kubernetes orchestration has concept of "live probes" of the container work processes and restarts broken ones.

That coordinator-process-and-worker-process pattern can be nested into multi-level hierarchies. Inside a single server is a coordinator-process-and-worker-process pattern -- but there's another data-center-level coordinator-process-and-worker-process pattern that watches all the servers.

Also, "Self-Healing" is a subtopic of "Fault Tolerance" so you'd get some more hits by searching for "fault tolerance". I put some links of writings I found helpful on that: https://news.ycombinator.com/item?id=33954078

KingOfCoders · on Feb 8, 2023

Thanks!

fredley · on Feb 8, 2023

The three most important things are: simplicity, simplicity and simplicity. Things that are more simple have less ways they can break, and if they do they're easier to reason about.

Be constantly asking yourself: is this the simplest way to do this? Is there a simpler way? e.g. Do you need a database? Do you need a server at all? How far can you get without these things? On the other hand maybe you're looking at newer serverless architectures? Is it really simpler to do that than use a VPS?

Be constantly explaining to yourself why the way you're doing it is the simplest and most robust way.

davewritescode · on Feb 8, 2023

I'm going to push back (slightly).

Simplicity is great, I 100% believe avoidable complexity is the enemy of good software. Sometimes the complex solutions are complex for a reason and simple has a way of working until it doesn't. My rule is always, build things the way you know how. If that's Kubernetes and a MySQL database, that's perfect. If it's AWS Lamdba + Dynamo, as long as you can support the cost model go ahead.

At some point as a solo founder, you're going to need to bring in other people to operate the software you (and eventually your team) are building. Onboarding those people into your genius "simple" solution ends up turning into a shitty version of existing, more complex solutions that you know have to support is what tends to happen everywhere.

lucasfcosta · on Feb 8, 2023

This is great advice.

It's fascinating how far you can get without adding things most people consider "essential".

If you just work backwards from goals and do scalability based on real numbers, you'll often get away with less than you think.

linsomniac · on Feb 8, 2023

Complexity is the enemy of availability.

residualmind · on Feb 8, 2023

One thing I've noticed that is sometimes forgotten, especially at earlier stages is monitoring. You want to know how much self healing is actually happening. Let's say you have your self-healing system in place, say some k8s pods combined in a service with a little redundancy and very little state. Pods happily crash, another one takes over while a new one spins up. All is wonderful and you don't worry about your availability anymore because everything just always works. One day you decide to look into whats happening in your containers and are shocked because one pod crashes every 0.3 seconds. It just spins up, answers 1 request but then dies and a new one spins up...continuously. From the outside everything looks kind of ok but in reality you are wasting massive resources and have a nasty bug that might be losing you even data, consistency, creating load, etc... Some sort of monitoring is a good idea is what I'm saying.

davewritescode · on Feb 8, 2023

Monitoring is super important++

But the nice thing about using an already resilient system like K8S is that pod crashes won't cause your customers to not be able to work and you can fix the issue in the background instead of having to throw up a status page and fix the problem immediately.

It's better to have a problem that your customers don't notice because it buys you time to figure out the issue.

codeduck · on Feb 8, 2023

Nobody ever cares about monitoring, until they need it. Then the tears flow deep and salty.

funcDropShadow · on Feb 9, 2023

That is one of the reasons why Brendan Gregg's USE [1] methodology is so great. USE stands for utilisation, saturation, errors. For every component, resource, or subsystem you should have at least a metric for each of these. Utilisation tells you: How much is it used? Saturation tells you: How near is it to the capacity limit or how much does it slow down because of load, and errors tell you e.g. when k8s pods restart all the time.

[1]: https://www.brendangregg.com/usemethod.html

andy_ppp · on Feb 8, 2023

What do you recommend I use to monitor my software then? Is there a good service I should use? Inside and outside the datacenter/AWS? What metrics should I monitor on Postgresql? Hacking attempts? There's a lot to consider.

DelaneyM · on Feb 8, 2023

Something not mentioned yet (at time of writing) is the importance of self-healing data.

You can always just hard-restart an app every minute or so to be resilient to nearly any failure condition in runtime (not that I've _ever_ done that before...), but if the data gets into an invalid state you're stuck.

The one time I needed extreme resiliency and recoverability I used a write-only DB with a materialized view which updated on change or startup, and every write in a transaction. I also tailed the DB updates to a file on disk which replicated regularly off-site. It could automatically recover from nearly anything, and was remarkably easy to set up. The hard part was the materialized view, but I "needed" that anyways as I wanted to keep a full audit log as the primary db.

What constitutes resilient data is going to be unique to your use case of course, but consider resiliency from the DB up.

Also I suggest investing heavily in grokkable and relevant runtime observability. (Don't just emit inline comments, put some thought into relevant data and alerts.). Often you'll see a failure coming days ahead of it causing a problem, and you won't need the app to self-heal.

pjc50 · on Feb 8, 2023

Yes, this is exactly it. It's also an important part of the "crash only" concept - the moment that you detect a bug or inconsistency, it's important to crash so that you discard potentially corrupt state in memory and re-load from the last known good point. This makes it harder to write out wrong state, because that's much harder to recover from.

The absolute nightmare scenario is state getting corrupted causing a crash-restart loop.

(a couple of days ago we had the "everything is about state" discussion on here: this is why separating "persistent state" into a separate box and keeping a very careful eye on it is important)

amelius · on Feb 8, 2023

I just give GPT3 access to the bash prompt, so any problems that occur on my servers will be solved eventually.

spoiler · on Feb 8, 2023

I nominate this to become one of those one-sentence horror stories

ale42 · on Feb 8, 2023

Nomination approved ;)

uniqueuid · on Feb 8, 2023

That's really a lot of ground to cover, but two simple recommendations would be:

- idempotency / consider all state as a cost

- start off with some chaos monkey testing - e.g. establish automatic regular restarts/re-deploys from the start

Both are no silver bullet but ensure that you won't have too much anxiety and can act robustly when something goes wrong. It also has a host of positive downstream effects, such as facilitating setup of new machines and/or scaling.

dmak · on Feb 8, 2023

I feel like idempotency is the solution to many problems.

elitepleb · on Feb 8, 2023

consider OTP with it's "let it crash" design there's a nice go port for that: https://github.com/ergo-services

carlmr · on Feb 8, 2023

Elixir also has better ergonomics than Erlang in that regard and uses the same OTP/BEAM mechanics.

klibertp · on Feb 8, 2023

Elixir has different goals than Erlang - it aims to be general purpose, while Erlang is clearly a DSL, though the D there is kind of huge. Erlang may be a better source of knowledge of OTP because it lacks distractions and doesn't hide as much as Elixir behind a convenience modules/macros/functions. As long as you don't want to run on the BEAM but learn something to apply in other environments. If you think you'll be using the platform, go for Elixir :)

nickpsecurity · on Feb 8, 2023

Here's you some examples to learn from. Most are old enough that patents have expired:

Why Do Computers Stop and What Can Be Done About It? (1985) https://www.hpl.hp.com/techreports/tandem/TR-85.7.pdf

Their NonStop Architecture https://www.hpl.hp.com/techreports/tandem/TR-86.2.pdf

QNX's systems are ultra-reliable, too https://cseweb.ucsd.edu/~voelker/cse221/papers/qnx-paper92.p...

OpenVMS clusters' uptime was years to decades & mixed CPU ISA's https://en.wikipedia.org/wiki/VMScluster

Systems that run forever and self-heal (Armstrong) https://www.youtube.com/watch?v=cNICGEwmXLU

Microreboots https://dslab.epfl.ch/pubs/microreboot.pdf

Minix 3: Microkernel-based, self-healing, POSIX OS https://www.youtube.com/watch?v=bx3KuE7UjGA

iuafhiuah · on Feb 8, 2023

Although these notes from Heroku are relatively old now, following all of the Twelve-Factor rules will get your code and associated pipelines most of the way to a place where self-healing is free by design.

https://12factor.net/

Cthulhu_ · on Feb 8, 2023

Is there a newer alternative? I still use and adhere to 12factor as much as I can, since there hasn't been anything better as far as I know. Note that I'm more of a front-end developer though.

wellanyway · on Feb 9, 2023

Why do you need a newer alternative? It's literally a pattern, age doesn't matter.

barakplasma · on Feb 8, 2023

https://3factor.app/

wellanyway · on Feb 9, 2023

Are you joking?

theshrike79 · on Feb 8, 2023

Use managed services as much as possible: AWS Lambda, DynamoDB, S3 etc, they practically never go down. And if they do, most of the world is down too, so you're not alone.

12 factor apps is a good starting point in general: https://12factor.net/

Crash early, make sure your application can recover from crashes or at the very least a crash caused by one client shouldn't affect any others.

unregistereddev · on Feb 8, 2023

> they practically never go down. And if they do, most of the world is down too, so you're not alone.

I agree with the second part - users are more likely to forgive an outage when Netflix is down as well. However, AWS is not as stable as I'd like. It seems the days of targeting 5 9's (99.999%) uptime are long gone. If you have high availability requirements, the cloud providers are a convenient way to get access to multiple datacenters. But, you absolutely must have cross-region failover. Amazon's products and datacenters only have "pretty good" uptime, and I've found their functional uptime is lower than advertised. When a network problem causes Lambda to become unusably impaired, Amazon still considers Lambda to be available.

KingOfCoders · on Feb 8, 2023

"AWS Lambda, DynamoDB"

Sadly no VC money to burn :-(

mabbo · on Feb 8, 2023

They have free tiers and are pay by use.

Take 30 minutes and get out a spreadsheet. What's your best and worst case expected needs in the next 12 months? Compare those needs to the expected costs on various cloud offerings.

Your time is the most expensive thing you have because you could be making $10k+/month making godawful widgets for BoringCorp. If you spend two months extra because you wanted to save $1000 because "no VC money", then you wasted a lot of money.

theshrike79 · on Feb 8, 2023

I've been running a small personal project (chat bot with multiple supported chat services) purely using the free tiers of both for a good 3-4 years now.

The good and bad thing about Lambda and DynamoDB is that they scale as far as your wallet does, but they also have good tools to manage the costs.

KingOfCoders · on Feb 8, 2023

Thanks!

forgotmypw17 · on Feb 8, 2023

I've been working on an application with such properties. Here are some patterns I have employed:

Break it up into many completely independent components. In my case, this is in the form of scripts which read from and write to queues. Each script reads from one queue and writes to another.

A lot of redundancy. From one queue to the next, there are at least two pathways which the data can take.

Lots of sanity checks. Wherever you are taking input, check that it has the expected format, shape, and content before processing it.

More redundancy. Write two versions of the same script in two different languages and make the system run them side by side and compare the outputs. If the outputs differ, there is a problem, and you should switch to the script which produces the correct expected output (and alert the operator.)

Avoid doing dangerous things. For example, querying the database using freeform strings is dangerous. So only query the database using sanity-checked identifiers which contain a predefined list of allowed characters which do not include quotes or anything else weird. Running scripts as a direct result of user request is dangerous, so serve only static HTML as much as possible. And so on.

throwawaaarrgh · on Feb 8, 2023

1. Use managed cloud based APIs. S3 for object store, k/v distributed store for more high performance/smaller records, authenticating APIs (from any cloud vendor, like Auth0), etc.

2. Use a managed cloud orchestration system. Autoscaling groups, AWS ECS Fargate, App runner frameworks, Serverless, Managed K8s.

3. Run operations from chatops, gitops, or a web UI. By making operations work over a remotely accessible communications tool, you can make changes from anywhere, anytime, and never have to deal with local environment setup or resource constraints.

4. Do development in the cloud (Cloud Shell, Codespaces, DevSpace, etc). Same rationale as #3.

5. Mahe everything as immutable as possible. If something fails, throw it away and replace it with a known good artifact.

If you can't run it from an API and pay someone to manage it, it's a waste of your time and money and not reliable. Don't be your own car mechanic for your business; lease a truck. If it ever takes off, you can easily give others access. And you can use this pattern for anything.

planb · on Feb 8, 2023

This is very much the opposite of the advice I would give: Keep it simple and rely on proven methods, keeping in mind what might go wrong and how to fix it. Just because you don't have to care for kernel updates or broken disks does not mean nothing can go wrong. I don't think there are many solo founders that know all the possible failure modes of these complex services you mention...

throwawaaarrgh · on Feb 8, 2023

Those things I mention are simpler than what you suggest. If they fail, the solution is "wait", because the cloud vendor maintains it. But they generally won't fail, because they're designed to be HA. Worst case you open a support ticket.

syats · on Feb 8, 2023

This is bad advice for a couple of reasons:

1. It is expensive. 2. It moves complexity away from you and onto your providers, so it doesn't really solve the problem, only hides it from you (at a price). 3. The overall cost (energy, person-hours, material) of even the smallest project grows a lot with this approach. Even if you have the money to pay for it, you are wasting a bunch of resources around the world just for an illusion of peace of mind. 4. Most importantly, it will still fail (as all systems eventually do) and then you have no idea where it failed or how to fix it. All you can do is file some support tickets at big-corp support center and watch for updates on their twitter feed.

A lot of people complain here on HN about the sad, over-complicated, state of software-engineering, the need to know more and more concepts and to manage more and more tech "stacks" just to accomplish boring, formerly simple, tasks. One reason for this sad state is the philosophy expressed in the parent comment.

tommiegannert · on Feb 8, 2023

First step is redundancy: having backups, failover, overprovisioning. Essentially prepared "plan Bs".

Next step is introspection: aggregate monitoring and enough detail to figure out if there are issues.

Next step is being notified when things break. I.e. anomaly detection and alerting.

Then, debuggability. Enough detail to solve issues. Disaster recovery testing is part of ensuring you actually have this, and not just believe you do.

Aside from that, there's CI/CD, automated scaling, automated isolation of bad actors. There are so many things one could do, but this also depends on how large the team is. I'll argue that this type of automation isn't that important if it's just one person.

The SRE book(s) [1] contain many of these high-level ideas. Don't try to do them all at once. :) (Bias: Niall, one of the editors, was my manager when I joined Google SRE.)

[1] https://sre.google/books/

bheadmaster · on Feb 8, 2023

Crash-only software, defined as software that is built from the grounds up to expect crashing, is very useful for stability.

If you write software from the beginning as if the only way to exit them is with SIGKILL, and if you make the application crash itself on any sign of fatal error, you get a reliable system.

dmos62 · on Feb 8, 2023

Why is crashing useful for stability? You say that it's good that a program crashes on a fatal error, but that's sort of backwards: any error that crashes the program is fatal by definition (that's what fatal means). I'm not sure what your point is.

I know Erlang/OTP as a successful instance of fault-tolerant programming. Its actor model means that on faults your process subtrees die, but the system as a whole heals itself. So I guess you might be talking about actor-level programs, not system-as-a-whole-level programs.

bheadmaster · on Feb 8, 2023

> Why is crashing useful for stability?

It's not crashing that's useful - it's crashing being expected that's useful. If your program expects to be crashed, it will be written in a way that allows it to continue operating no matter what happens. And if there's any error that could cause it to malfunction, it can just crash, re-start and continue.

In contrast to many programs that crash, then refuse to start because of corrupted data or what not.

> any error that crashes the program is fatal by definition

But not every error that is fatal crashes the program :)

captn3m0 · on Feb 8, 2023

Understand and document your termination and startup signals. Any self-healing setup must:

1. Identify the state of your app (starting, running, ready, etc).

2. Stop traffic once the server goes down.

Very often you'll have state confusion (sigterm triggered, but server kept accepting requests). Make sure your signal handling works well across both your ingress, and the server.

A nice and easy hack is to have a /status endpoint in all of your apps that returns:

1. the current commit deployed

2. the availability of dependencies (db reachable? db connected? any missing environment variables?).

3. Which instance/pod/server is serving this request. (Just returning hostname typically works)

gwbas1c · on Feb 8, 2023

A few patterns I used in Syncplicity (Major Dropbox, OneDrive competitor.)

At a high-level, every operation had a try-catch that caught all exceptions. (This is similar to panic-resume in Go, but the semantics are very different.) A lot of operations fail on edge cases that you can never fully anticipate. (We had to deal with oddball network errors that wouldn't reproduce in our development/test environment, oddball errors caused by 3rd party applications that we didn't have, ect, ect.) It's important to have a good failure model...

... Which comes to exponential retry: Basically, for operations that could fail on corner cases, but essentially "had" to work, we'd retry with an exponentially-increasing delay. First retry after 1 second, then 2 seconds, then 4 seconds... Eventually we'd cap the delay at 15 minutes. This is important because sometimes a bug or other unpredictable situation will prevent an operation from completing, but we don't want to spam a server or gobble up CPU.

Try to make almost all operations transactional. (Either they succeed or fail, but they never happen in an incomplete / corrupted manner.) You can get that "for free" when you use a SQL database: We used SQLite for local state and almost exclusively used SQL in the server. For files that stored human-readable (QA, debugging) XML/JSON, we wrote them in a transactional manner: Rename the existing file, write the new version, delete the old version. When reading, if the old version existed, discard the new version and read the old one. We also implemented transactional memory techniques so that code wouldn't see failed operations.

Finally: Concurrency (threading) bugs are very hard to find, because they tend to pop up randomly and aren't easily reproducible. The best way to do concurrency is to not do it all. If you can make your whole application single threaded and queue operations, you won't have concurrency bugs. If you have to do concurrency, make sure you understand techniques like immutability, read/write locking, lock ordering, and reducing the total number of things you need to lock on. Techniques like compare-exchange allow you to write multithreaded code that doesn't lock/deadlock. Immutability allows you to have non-blocking readers, if readers can tolerate stale state.

kkirsche · on Feb 8, 2023

I think it’s important to review well established frameworks for fault tolerance. Things like Erlang’s OTP come to mind which demonstrates concepts like supervision trees.

I don’t have a great resource of these frameworks or patterns but my approach has been to learn from the characteristics of historically successful systems that I want similar capabilities to (if I can’t use the framework directly)

bratao · on Feb 8, 2023

One thing I learned from high frequency trading programming, it to, where possible, restart the servers/software's daily.

KingOfCoders · on Feb 8, 2023

Great point, I'll add a go routine that sends a signal every few hours to make systemd restart the app as an experiment.

adql · on Feb 8, 2023

systemd unit with

    Restart=always
    RestartSec=30

is enough for 99% of the small apps

If you can get away with VPS(es) and cloud hosted DB that will be by far the simplest solution to manage. Or "serverless" if service is small enough and doesn't need to be persistently running.

Once it starts bringing actual money you might then start thinking about dockers, k8s and other fancy stuff

diamondap · on Feb 8, 2023

jasode's answer is spot on. At work, we run an digital preservation system that ingests large file sets into S3, Glacier and Wasabi. The files have through a pipeline of microservices to verify integrity, identify file formats, and extract other metadata.

We use AWS Elastic Container Service (ECS) to ensure all of the services are running and to scale when necessary. We use NSQ to make sure tasks are sent and re-sent to workers until they complete the entire pipeline. And we use Redis to store interim processing data from each worker.

Keeping that interim data (state) in a place where all workers can access it is key. ECS kills workers when it wants, and workers occasionally die due to out-of-memory exceptions. With the state info stored in Redis, a new worker can pick up any failed task where the last worker left off.

This system has been running well in production under heavy load. It replaces an older server-based system that used different technologies to handle the same responsibilities: supervisord to keep processes running and BoltDB to store interim processing data. That system worked, but could not scale horizontally because BoltDB is a local, disk-based store. Distributed workers need a shared, network-accessible store to share state info.

You'll find a detailed overview with diagrams of the new system at https://aptrust.github.io/preserv-docs/overview/

There's a shorter write-up of the goals and how we achieved them at https://sheprador.com/2022/12/architecting-for-the-cloud/

This stuff isn't too hard, as long as you get the pieces right. Try to stick to fredley's advice. Keep things simple and you'll save yourself many headaches going forward. Also, be sure your workers handle SIGTERM and SIGKILL explicitly, cleaning up their work as much as possible before they die.