This kind of system will suffer from the ratchet problem. A single bug that negatively impacts state is no longer fixable by rebooting. Instead, you have to format/reinstall. Unintended state becomes permanent, and upgrade paths constrained to the point where you must do intentional damage. I'd also be leery of fragmentation.
The project is falling into the "simplification trap", where all it's doing is moving inherent complexity around (which eventually requires users to do a bunch of work-arounds) rather than eliminating it (because inherent complexity can't be eliminated; only needless complications can be eliminated).
In theory, this sort of setup would be nice in a perfect world, but in the real world of buggy software and faulty hardware and cosmic rays and failing network connections, it's a disaster waiting to happen.
Well said, first thing I've though when I saw this project was: sounds like cache-invalidation hell.
Now a second and genuine follow up and wildly generalized question:
What if, given that information isn't lost (no-hiding theorem), death in the biological platforms is a reset for consciousness to continue its evolution after a renewal of any state pollution?
These conservation laws apply to open systems (which biological platforms are). All ready at the cell level the primary concern is how to "efficiently" do energy cascades which sometimes have unintended (irreversible) side-effects. So cells get "garbage collected", there are other more system level irreversible changes, some intended (memory) and unintended, which I guess are indeed ultimately unsalvageable, which is why reproduction is a thing (pristine copy of genome in new cell).
Even in the case of reproduction, you don't end up with a pristine copy in a new cell.
The study of epigenetics is the realization that cells don't have exec(), only fork(). Most state transfers to the child cells and isn't wiped during reproduction.
You could base such system on software transactional memory and persistent data structures (like in Clojure) and in case of a bug just revert the state to any particular point in time by changing one pointer.
Your app crashed? No problem - Rewind 5 minutes earlier and don't do the thing that crashed it.
There's whole unexplored universe of possibilities when we do away with traditional OS design. I'm especially interested in how it would work with Intel Optane.
Rebooting does not fix any bugs, it just resets the state. I'm not sure this is really the better solution because many bugs never get fixed because "Have you tried turning it off and on, again?"
It is the better solution because it delineates the difference between important data that must be preserved, and not important data that can be thrown out and regenerated (turning it off and on again).
Systems that eliminate this difference massively increase the damage surface of your data by forcing both kinds to be treated equally and even intermix. And since software will always have bugs, your fallout damage increases exponentially.
It's a lot like naive state saving code that just dumps an in-memory struct to disk: The moment that struct changes (adding/removing/changing a type or size), your load code breaks.
Rebooting (or "restarting the process" in a more limited case) does not fix the bugs, but is as a very effective workaround in a lot of cases.
Yes, a lot of bugs are never going to be fixed, and we'll have to restart the machine occasionally... but Phantom's alternative is forcing machine reformat on any bug.
You can do this today: just add a crash handler that will erase your entire hard disk. Do you think this will make software less buggy?
Resetting the state is a necessary step when the state is corrupted. You can’t work around this with “don’t corrupt the state”. So many developers are running the other way: isolate state; reduce dependence on state; eliminate side effects; crash only software; etc. The state doesn’t need to be preserved because it simply isn’t that valuable.
That's likely already baked in, but then you end up with an infrequently run bootstrap code path, and "resetting" an app basically means nuking all of its data (some of which might be important to you).
This is why we have a clear delineation between persistent and ephemeral state.
This is what happened to my iOS install. Due to some problem (is it an exploit hidden somewhere among incoming messages?) my Messages app opens for 5+ seconds every time, and there is absolutely nothing I can do, except setting up a whole new iPhone from scratch (and hoping the problematic message win't get synced from iCloud anyway).
On rooted Android I could clear corrupted app data.
Yes, I have a similar feeling in SmallTalk (i.e. Squeak or Cincom)
A corrupted image is difficult to fix, and you end up saving the code on a "parcel"/package and reload it on a clean image.
But now since there's no boundary between persistent and ephemeral state, reverting to a snapshot will damage all of your other processes and lose data in unpredictable ways due to the halting problem. The cure could end up worse than the disease.
The distinction between persistent and ephemeral state may be too coarse-grained. In many applications like preference systems and object persistence more fine-grained distinctions are needed. As a typical example, miscellaneous data like window positions, sizes, and settings can and should be reset under certain conditions in desktop applications, but they still need to be persistent. There is a need for systems of default values & conditions that regulate when resets from corrupted states are allowed and when resets are executed for certain types of data (but not for others).
I've never seen a framework or OS-support such fine-grained persistence and failure management. It's strange because almost every application needs something like this, and if it's just to reset a faulty preferences file. I've always thought that these kind of features should be provided by the OS, together with indexing and better guarantees for file integrity (e.g. ACID-compliant atomic file operations).
I think the idealism lies in the ability to reset/reboot. How much system behavior is undefined? From a security perspective, it's nearly impossible to tell exactly what malware did when so much data is ephemeral.
Isn't that what snapshots are for? If rebooting breaks then rollback to a previous snapshot of the kernel or core components but keep all the users data intact.
Now find out when your problem was introduced. Your very, very lucky if that was 30 minutes ago. But there will also be hidden issues that can lie dormant for weeks or months. Good lucky using rollback for those.
In traditional OS, where data is separate from code, yes.
The whole point of Phantom OS though is that there is no separation between them. Instead of having files, you just keep the data in your program's memory, and rely on automatic system-wide persistence to keep it safe. So the only thing you can roll back is entire system.
The system could offer functionality to make restore-points of specific program memory spaces. Programs could use API to trigger restore-point creation.
I like the ideas in Phantom, even considering the negative comments, it is refreshing to see people having a go at OS design that isn't just blindly making yet another UNIX clone, because apparently we can't get enough of them.
Are you sure you know what 'persistent' means in this context? Because I cannot see how network booting leads to the kind of persistence we're talking about. In fact, you seem to be describing technologies that help in implementing the exact opposite.
Also, you've missed the point of the Linux comment. I'll spell it out: typically people who revere UNIX in the way described have never in their lives used a UNIX system other than Linux (unless it was MacOSX+, of course). Is that relevant? Not really, that's why it is a footnote.
I'm not sure this actually solves the real problem that people have. Sure, reboots are annoying and pretty much all current OSes could be improved in this regard. Back in the 90s, it's interesting that SunOS supported live kernel upgrades, so at the same time the kernel on disk was updated, the running system was also patched so it would act the same way as-if rebooted, but without any disruption to the live system.
However, in the current days where rebooting is seen as normal for an OS upgrade, but most of the time people just put their computer to sleep, the main problem is that all the network interfaces are effectively useless when the system wakes up because most connections will have long since timed out due to lack of connection acknowledgement packets. In such a case, systems will have to tear-down and re-create a lot of state anyway.
The actual issue this seems to solve, that of saving memory state and restoring again, is already solved for most use cases except OS upgrades by using sleep mode. But the hard case about network connections is still unsolved for this system, and by solving the problems it does solve with yet another VM-based environment it'll probably be doomed to obscurity unless common existing applications and virtual machines can easily be made to run on-top of it.
> all the network interfaces are effectively useless when the system wakes up
This is - and probably always will be a problem with networked software. You should program as if TCP connections could drop at any time, for any reason. Browser tabs get backgrounded. Laptops go to sleep. Cell phones roam, and go into tunnels. Servers have temporary net splits.
Most high level tasks can be retried safely. Nonces and things can be used to safely retry almost anything else.
I miss the simplicity of IRC, the way slack and discord smoothly transition between online and offline states is graceful and intuitive. That is how almost all software should behave.
It's even more essential in IoT and mobile devices, where reboot is often necessary, sometimes just to restore broken connection, like modem. Nobody works now on stable software, everyone tries to ship it as fast as possible. It's not necessarily bad thing, but it is certainly orthogonal to the persistence.
The plan is (was?) running any JVM stuff on their VM.
Also, IIRC Dmitry was thinking about embedded systems applications at least back in 2011 when he was giving a talk about PhantomOS at HighLoad++ conference in Moscow.
Nowadays, if you can make your hobby OS run WebAssembly I think you'd be able to get a lot of functionality out of it, even if it can't run Office x86 binaries.
> Its primary goal is to provide environment for programs thatsurvive OS reboot. Such an environment greatly simplifies software development
Simplifies? I can't even imagine such a program. To me a program is something that starts and ends and can also re-start with a fresh state in case something goes wrong.
What about some middle ground? I’d like my IDE to remain constant between reboots, my browser to some degree is already (if it crashes, it reopens all current tabs on start, which is effectively the same).
I would have thought some clever hacks on an existing kernel would be reliable enough rather than an entirely new OS just for one feature, though.
Why do we need this in a kernel? Why not just serialize the essentials of the apps states and re-load it after reboot/sign-in?
I'd prefer the desktop environment to remember which apps were running including which documents were opened in them and their windows positions and just re-launch everything at start-up. Given how fast does everything cold-boot today thanks to modern SSDs, I wouldn't even use standby/hibernate if this worked this way.
That does sound great, but realistically any application needs to be able to recover from a crash anyway - so even with this support from the OS, you just have a new code path to support. Might be nice for users, but it certainly doesn’t simplify anything for developers.
The IDE I use, QtCreator, works like this: if you reboot or it crashes you can just restore the previous session on startup and it will bring you exactly where you were.
Sort of like that but you don't have to wait until all your 16Gb of RAM are written to disk. AFAU, Dmitry found a cheap way of making consistent snapshots of a running system.
BTW, there was a thread on HN relatively recently where someone mentioned that the hard part of hibernation is not restoring the RAM state but bringing all hardware into the right state at boot. That's why hibernation has been so problematic on Linux.
Something like “crash-only software”; programs are expected to be killed and relaunched. Software should cope with it, as there is no swap space and ideally the user shouldn’t be managing memory. An app may create to some degree the appearance of persistence but it’s not persistence. Users, also, can easily force kill apps. Ongoing background operations like downloads, uploads and some media playback is delegated to the operating system. Such operations continue when your process is killed, and you check up on them when your process starts. I hope it’s not TMI, but the standard library includes atomic file saves.
I studied some persistent object store databases in the past - which seemed like an incredibly good idea and I think the AS400 used such a database and was very popular.
It removed the need for a filesystem or any of the usual patterns for retrieving data like SQL so a whole class of programming that people think of as "normal" today just vaporised.
Persistent programs seem like a logical-ish next step but I wish the first step could have been taken because it was very nice to program in.
Not related to Phantom OS, but the OS running the Apollo 11 guidance computers had this same feature. In fact, the famous 1202 program alarms right before landing are related: a low priority task was overwhelming the system and the OS restarted to get a clean slate, continuing the high priority tasks where they left off:
“The software rebooted and reinitialized the computer, and then restarted selected programs at a point in their execution flow near where they had been when the restart occurred.” [0]
Did it? It seems more like that guidance application was saving checkpoint data, rather than OS feature. Though distinction between OS and applications might be muddy here. In fact persistent OS design would be catastrophic, as restoring all jobs would just cause resource exhaustion again.
Well, that's what happened: "On Apollo 11, each time a 1201 or 1202 alarm appeared, the computer rebooted, restarted the important stuff, like steering the descent engine and running the DSKY to let the crew know what was going on, but did not restart all the erroneously-scheduled rendezvous radar jobs."
Resource exhaustion was not immediate after reboot because the faulted tasks were low priority and did not get started after reboot right away. I can't even imagine what would have happened if one of the high priority tasks had been the problematic one.
>but did not restart all the erroneously-scheduled rendezvous radar jobs
That's wrong. Excess CPU time was stolen by radar counter hardware, not by any software (counters worked by stopping CPU and using its ALU). Problem arised because main guidance routine (SERVICER job) was scheduled always every 2 seconds (by READACCS task/interrupt), and with excess stolen time it didn't finish in time, leading to scheduling of another SERVICER before previous instance finished. This repeated until memory ran out for stacking another SERVICER job. There was no "erroneously-scheduled rendezvous radar jobs", job that needed shedding was multiple stacked instances of SERVICER itself.
I get similar benefits from running an application in a VM. If I don’t want to shut down the application, I just save the running state of the VM. But if I need to recover from a crash or other state error, I can still reboot the VM.
How do you handle backup and restore on a system like this? If the snapshots are at the OS level, and there is cross-app commingling of data objects, how can you restore the state of a particular application or object (formerly "file") without breaking everything? How do I restore a single document or a single contact in my address book?
I wonder how much wear on the system disk is caused unnecessarily by taking continuous snapshots.
I think that you would like to limit the actual data to persist to the data that needs to be be persisted — that which can't be recreated quickly, purely from other objects.
Isn't the ideal of OS not to need to reboot ever? This project is about intentionally rebooting?
I just had the question of 'whatever happened to ksplice?' and immediately found my answer of 'oh oracle' no wonder nobody talks about ksplice anymore.
I'm going to self plug here on work I'm apart of that does persistent processes, although the motivation is different. I do think this paper does a good job of informing the reader on why these sort of features would be really cool that this project does not necessarily dive into. But it requires widening the API and allowing for developers to choose how they persist.
I remember reading about another, pretty old, microkernel OS which had persistent processes. It was probably a capability based OS. Anyone know what that was?
Ooh interesting, probably unrelated but I was thinking a GAI would use one of these at its core eg. "it can't die". For the self/state aspect.
Also a criteria is code that modify itself without recompiling.
This also considers a power source like a nuclear battery or something that will last a long time or alternate in the low power state/ambient or whatever energy.
New operating systems are really needed for virtual and augmented reality. Especially with higher resolution, more comfortable devices that people use for work. I think the 3d file system representation will finally become normal.
It will be an exciting time for user interface development.
Possibly we could see networked collaboration/"multiplayer" at the OS level.
The project is falling into the "simplification trap", where all it's doing is moving inherent complexity around (which eventually requires users to do a bunch of work-arounds) rather than eliminating it (because inherent complexity can't be eliminated; only needless complications can be eliminated).
In theory, this sort of setup would be nice in a perfect world, but in the real world of buggy software and faulty hardware and cosmic rays and failing network connections, it's a disaster waiting to happen.