I'm working on embedded systems and I've seen and heard some horror stories just...

AlotOfReading · 2025-03-14T21:26:58 1741987618

That's a strong start, but be careful if your system ever evolves beyond a single logical processor. You'll need additional orchestration to have reliable updates in a distributed system with semi-independent processors. The update on one might succeed, while another fails. Depending on when the old images were produced, the new images might not be able to talk to each other. Depending on their relative roles in the system (e.g. one sets up the power supply or network for the other, or acts as the time master to do certificate validation) this may or may not be an easily fixable issue even if each system locally thinks it's okay.

This sort of functional interdependency has become increasingly common in embedded these days with heterogenous SoCs.

One thing I've seen before is to separate downloading from rebooting, broadcast the manifest for the updates between all the independent processors (all updates need a declarative manifest for so, so many reasons) to check locally, and only proceed when they all agree. Rollbacks are initiated if they can't see everyone with their expected versions afterwards.

Still isn't perfect either.

boricj · 2025-03-14T22:45:39 1741992339

Fortunately, it's a single no-frills MCU running the Zephyr RTOS. It does communicate with another system, but they are so very loosely coupled to the point that we really don't care whatever is running on the other side.

I won't get into details, but in some of the horrors stories I've heard the distributed system happened to be entirely software in nature. There are plenty of creative ways to mess up an upgrade on a uniprocessor system.

fragmede · 2025-03-14T20:12:58 1741983178

add a watchdog timer to reboot automatically on failed upgrade as well.

boricj · 2025-03-14T20:22:48 1741983768

We already have a watchdog timer. We could automatically trigger a factory reset after N bootloops following an upgrade, but it's up to the end-user to decide to flip the switch so we won't go there.

I kept the summary short and simple, partly because that product isn't out yet and also because I don't want to bury the lead with a lot of extraneous details that we do take into consideration, but are irrelevant to the big picture idea of an upgrade method that factory resets the card and restores its state with a codepath shared with the end-user save/reset and configuration mechanisms.