I recommended “Understanding Distributed Systems: What every developer should know about large distributed applications” by Roberto Vitillo to all my colleagues back when I worked on SaaS systems.
“Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems” by Martin Kleppmann as the more advanced deep dive.
Both books provide timeless conceptual advice. Kleppmann’s description of developing a database by starting from an append-only text file really stuck with me.
> Lamport's paper "Time, Clocks, and the Ordering of Events in a Distributed System"
I know this article is a classic. I studied it at school but I've always found it very hard to understand. Maybe I'm wrong but I have the feeling that relatively few engineers use these formalisms as their mental models when designing distributed systems.
It was surprising that Kleppman's book was mentioned only at the very end of the article, but at least it came with an understandable caveat. That book is incredible - although in all honesty it does require solid foundation of distributed systems to make proper sense.
Until you have personally battled with replication lag, real-life impacts of eventual consistency and distributed writes, Data-Intensive Applications feels like a dry theoretical read. If you do come across the book with the scars and lessons, it does open the world up.
I often like to think that, at a basic level, all a [edit: indexed] db "does" is move our O(n) search of an unordered text file to the O(log n) search of a tree
From a high-altitude view, that's why splitting a huge database table into smaller partitions is not an automatic performance win. If you have M partitions with N rows each, then a lookup might require O(log M) time to find a partition and O(log N) time to find a row within the partition. But O(log M + log N) = O(log MN) which is what you would get from a single big table with appropriate indexing.
Of course, in the real world constant factors and implementation details matter, so this is just a heuristic. But it seems to run contrary to a lot of novice programmers' intuition that a large DB table must automatically be a slow one.
The book that Dominik Tornow is writing “Thinking in Distributed Systems” has been an excellent next read after DDIA for me (it’s not yet finished I believe).
Really shows the experience of someone who understands this stuff inside and out (was one of the main people behind Temporal).
“Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems” by Martin Kleppmann as the more advanced deep dive.
Both books provide timeless conceptual advice. Kleppmann’s description of developing a database by starting from an append-only text file really stuck with me.