The writing is amazing. I had a turbo-XT back in the day, so parts 1 and 2 really bring you back if you were of that age.
Part 3:
"Running C code is a bit more difficult. The problem is that the code generated by any standard C compiler for the x86 CPU will heavily depend on RAM. System RAM is not enabled yet and the code to enable it is so complex that you really want to do it in C. Two solutions have been devised and both of these are in use (one or the other depending on the hardware).
Use a special C compiler (romcc) that does not make use of RAM, but keeps all data in registers. As the register set of the x86 is quite small (only 8 general purpose 32-bit registers), this severely limits the things that your C program can do. As the CALL and RET instructions cannot be used (they always use the stack in RAM), all C functions have to be inlined.
Use the CPU cache as a data RAM. This requires some special tricks to pretend that all cache lines contain valid data and to prevent them from being evicted. Which tricks are exactly required, depends on the exact model of CPU, but it can be done. The Cache As RAM trick (CAR) yields at least 16kB of usable RAM, sufficient as stack space for a simple C program. All recent hardware ports use this trick."
We run into the same problem on ARM systems, but they tend to be much more forgiving when using cache as RAM. Generally there is a bit you can flip (either in memory-mapped IO or as a coprocessor instruction) to turn cache into RAM, which is how most ARM boot roms work.
Furthermore, most embedded platforms simply hardcode the timing values for the variety of RAM that they put down on the board. Since you know that the RAM, CPU, and board won't change, you can calibrate once and be done with it.
For Novena, we support swapping DDR3 modules, so we need to recalibrate on every boot. We're not so restricted, since we have a good 128k (or 256k) of on-chip cache-as-RAM, so we just do it all in C. Of course it starts out with reading the RAM configuration out of the SPD chip on the module, but timing calibration must be completed regardless. If you're curious, the code we use to do this is available at https://github.com/xobs/u-boot-novena-spl/blob/u-boot-spl-no...
I also like how bringing the RAM up works: it starts with getting the south bridge running, then querying the RAM SPD for timings (via the south bridge smbus), and then programs the northbridge with the right timing information. Wow.
Modern high-speed DRAM interfaces are very complex. Their initialization is dependent on the particular memory module in use, and some parameters are even determined at power-on via training. So there actually are no "sensible defaults" which can be expected to work reliably with all memory modules. The initialization and training routines are far too complex to justify implementing them entirely in hardware, thus the need for software hacks to get things up and running.
If you think about it, many modern CPUs have as much cache as XTs had RAM.
It's no surprise that a I7 can work perfectly with no DIMMs attached to it and actually for a brief period of time, the BIOS does exactly that until it starts the DDR memory.
Another interesting thing, no matter how many cores the CPU have, it always starts in single-core mode and you have to switch the other cores on.
Mind blown indeed. But don't these need special CPU instructions? Which x86 instructions do this? (Or is this done through a special hardware pin on the CPU?)
Part 3:
"Running C code is a bit more difficult. The problem is that the code generated by any standard C compiler for the x86 CPU will heavily depend on RAM. System RAM is not enabled yet and the code to enable it is so complex that you really want to do it in C. Two solutions have been devised and both of these are in use (one or the other depending on the hardware).
Use a special C compiler (romcc) that does not make use of RAM, but keeps all data in registers. As the register set of the x86 is quite small (only 8 general purpose 32-bit registers), this severely limits the things that your C program can do. As the CALL and RET instructions cannot be used (they always use the stack in RAM), all C functions have to be inlined. Use the CPU cache as a data RAM. This requires some special tricks to pretend that all cache lines contain valid data and to prevent them from being evicted. Which tricks are exactly required, depends on the exact model of CPU, but it can be done. The Cache As RAM trick (CAR) yields at least 16kB of usable RAM, sufficient as stack space for a simple C program. All recent hardware ports use this trick."
Mind. Blown.