It's possible to make a compiler backdoor that is "updatable" and therefore a lot less brittle. And yes this does make the backdoor easier to detect since it's now communicating over the network. But such flexibility could really future-proof the backdoor and let it evolve over time as the target language changes.
For example, you could also make a compiler compile certain other software incorrectly in order to introduce exploitable vulnerabilities in the binaries. When I was working on convincing people of the importance of reproducible builds, I used to use an example where changing a single bit in the binary could introduce a fencepost error by changing a conditional branch operation into a different conditional branch operation. If the conditional branch related to overwriting memory and incrementing pointers (for example), that could make the resulting binary exploitable even though there was no fencepost error in the original source code.
(My examples on x86 involved changing JGE to JG, or JL to JLE, corresponding to changing >= to >, and < to <=, in loop conditions.)
Combining this with the trusting trust attack, you could have a self-perpetuating bug in the compiler plus a bugdoor in other software. The pattern match for the other software does not necessarily have to be super-specific in that case.
I would definitely agree that this wouldn't survive that many generations of software evolution without active intervention. It definitely wouldn't survive a change of programming language or target machine architecture, for example.
Who said it has to "generalize"? No virus generalizes to hack every program. That doesn't mean viruses aren't dangerous.
Also OSS makes up most of the modern stack, so access to source code is a given. And hand-crafting a backdoor when you have the source code is trivial because you can literally change anything you want with confidence.
I actually tried comparing 128-bit SIMD to the 64-bit performance and the difference was 2x. I only published the results for the 4x comparison, but it should be pretty easy to reproduce if you change the types in the non-SIMD code[1] from i32 -> i64.
Great questions! I'm not a database expert either but I can try answering these:
1) I think databases like to manage pages directly because the db can make more optimizations than the OS because the db has more context. For example, when aborting a transaction the db knows its dirty pages should be evicted (i'm not sure if mmap offers custom eviction). Also I believe if the db uses mmap, it loses control over when pages are flushed to disk. Flush control is necessary for guaranteeing transaction durability.
2) What you're describing here sounds similar to a LSM-tree database (e.g. RocksDB). They are used often for write-heavy workloads because writes are just appends, but they might not be great for read-heavy things.
3) This reminds me of PRQL[1] (which was trending on Hacker News last week) and Spark SQL. I'm not too familiar with this area though, so I can't really say why SQL was designed this way.
Another thing to consider is pluggable storage (a key/value interface) and pluggable query language (relational algebra interface?) and how to fit the two together.
Yeah, I think its a shame that most teachers don't give assignments like this that tie the big picture together with the low-level details. After students complete a big assignment like SimpleDB, they'll have a working artifact that they can reference for the rest of their career
I think the main issue that universities face is time. There is only so much time in each semester and, as we all know, building and improving on a database is a lifelong task.
When I was implementing SimpleDB in 2019, I believe CMU's course didn't have resources and lab assignments that were publicly available. Now CMU has published a full video lecture series (which MIT doesn't have) and their labs. So if I were starting again today, I would probably go with CMU's course.