I'm working on embedded systems and I've seen and heard some horror stories just on the device's side. Piles and piles of pre- and post-reboot shell scripts filled with race conditions against the system's services and themselves. When these break, if you're lucky a factory reset is enough to fix the system, if you're unlucky they become field bricks.
I'm trying to buck the trend though and on the new embedded system I'm working on, I've specifically designed the upgrade system to be as reliable as I can make it. It goes something like this:
- The new firmware is downloaded to the secondary application slot.
- Just prior to rebooting, the entire state data of the system is serialized as a document and stored on a flash partition.
- The upgrade flag is set, the system reboots and MCUboot does its thing.
- The new firmware finds out a upgrade happened, clears out all the data partitions, restores from the document and then clears out its partition.
The system is basically sanitized and restored after each upgrade. It's also the same codepath that handles saving and restoring the system's configuration by the end-user as well as settings management. If the document schema is for an older version, run the N-to-N+1 schema upgraders on it prior to applying instead of trying to patch the system in-place. If something goes horribly wrong, flip a jumper to trigger the heavy-duty sanitization that nukes the entire external flash (internal flash only contains the bootloader, primary application slot and factory parameters so it's essentially read-only once the application boots).
It might be hubris, but I hope it's good enough that I'll never see a bricked card that can't be resurrected by a factory reset with this project (assuming no hardware damage, no internal flash corruption and no bricking firmware getting signed with production keys seeping through the cracks despite all the checks in place).
That's a strong start, but be careful if your system ever evolves beyond a single logical processor. You'll need additional orchestration to have reliable updates in a distributed system with semi-independent processors. The update on one might succeed, while another fails. Depending on when the old images were produced, the new images might not be able to talk to each other. Depending on their relative roles in the system (e.g. one sets up the power supply or network for the other, or acts as the time master to do certificate validation) this may or may not be an easily fixable issue even if each system locally thinks it's okay.
This sort of functional interdependency has become increasingly common in embedded these days with heterogenous SoCs.
One thing I've seen before is to separate downloading from rebooting, broadcast the manifest for the updates between all the independent processors (all updates need a declarative manifest for so, so many reasons) to check locally, and only proceed when they all agree. Rollbacks are initiated if they can't see everyone with their expected versions afterwards.
Fortunately, it's a single no-frills MCU running the Zephyr RTOS. It does communicate with another system, but they are so very loosely coupled to the point that we really don't care whatever is running on the other side.
I won't get into details, but in some of the horrors stories I've heard the distributed system happened to be entirely software in nature. There are plenty of creative ways to mess up an upgrade on a uniprocessor system.
We already have a watchdog timer. We could automatically trigger a factory reset after N bootloops following an upgrade, but it's up to the end-user to decide to flip the switch so we won't go there.
I kept the summary short and simple, partly because that product isn't out yet and also because I don't want to bury the lead with a lot of extraneous details that we do take into consideration, but are irrelevant to the big picture idea of an upgrade method that factory resets the card and restores its state with a codepath shared with the end-user save/reset and configuration mechanisms.
I'm trying to buck the trend though and on the new embedded system I'm working on, I've specifically designed the upgrade system to be as reliable as I can make it. It goes something like this:
- The new firmware is downloaded to the secondary application slot.
- Just prior to rebooting, the entire state data of the system is serialized as a document and stored on a flash partition.
- The upgrade flag is set, the system reboots and MCUboot does its thing.
- The new firmware finds out a upgrade happened, clears out all the data partitions, restores from the document and then clears out its partition.
The system is basically sanitized and restored after each upgrade. It's also the same codepath that handles saving and restoring the system's configuration by the end-user as well as settings management. If the document schema is for an older version, run the N-to-N+1 schema upgraders on it prior to applying instead of trying to patch the system in-place. If something goes horribly wrong, flip a jumper to trigger the heavy-duty sanitization that nukes the entire external flash (internal flash only contains the bootloader, primary application slot and factory parameters so it's essentially read-only once the application boots).
It might be hubris, but I hope it's good enough that I'll never see a bricked card that can't be resurrected by a factory reset with this project (assuming no hardware damage, no internal flash corruption and no bricking firmware getting signed with production keys seeping through the cracks despite all the checks in place).