The 8088 processor in the first IBM PC had a bug that gave me some grief.
(The code below is likely to have bugs of its own - I wrote it from memory as an illustration of the CPU bug - and thanks to 'tlb' for catching an error in my first draft. I also left out the question of what data segment the various MOV instructions use for their memory references, as it isn't relevant to this CPU bug.)
If you needed to work in a different stack from the one you were currently running on, you might do something like this:
mov saveSP, sp
mov sp, mySP
...
mov sp, saveSP
This saves the original SP (Stack Pointer) register, loads it with your private value, and then restores SP when you are done.
Suppose you wanted to switch not only to your own stack pointer but also your own stack segment. With 16-bit registers you could only address 64KB at a time, and you would need to change a segment register to access memory outside that range.
So you would save, change, and restore both the SS (Stack Segment) and SP registers:
Now imagine that an interrupt triggered in between one of the changes to SS and the matching change to SP. The interrupt code would now be running on the new stack segment but the old stack pointer, corrupting memory and crashing.
Not to worry! Intel had your back. The documentation promised that after a MOV SS or POP SS, interrupts would automatically be disabled until the next instruction (the matching MOV SP or POP SP) completed.
But they kinda forgot to implement that feature. So if you followed the docs, you would have these very rare and intermittent crash bugs.
Word got around fairly soon, and the fix was simple enough, disable interrupts yourself around the paired instructions:
mov saveSS, ss
mov saveSP, sp
cli
mov ss, mySS
mov sp, mySP
sti
...
cli
mov ss, saveSS
mov sp, saveSP
sti
This still left you unprotected against NMI (Non-Maskable Interrupt), but by the time most of us built NMI switches for our IBM PC's, we'd also upgraded to newer CPUs with this bug fixed. It was only the earliest 8088s (and perhaps 8086s) that had the bug.
"As someone who worked in an Intel Validation group for SOCs until mid-2014 or so I can tell you, yes, you will see more CPU bugs from Intel than you have in the past from the post-FDIV-bug era until recently."
I think Intel was more concerned about the time it took to make a new CPU rather than the cost. At least that was my impression of it at the time.
That testing is a cost is a given. But it's a known cost compared to what a huge batch of faulty CPU's can cost. Or how about a ruined reputation, how do you even know what that could cost you?
I suppose Intel already use a lot of automated testing, but given all the bugs since the change it seems it is not enough.
It's funny to hear that the bug increases are an effect of Intel trying to compete with ARM SoCs in mobile devices, because the errata those have are much worse --- and indeed a lot of embedded stuff is like that because the general line of thought there is that bugs are worked around in software and there's little expectation of being able to run existing code flawlessly, unlike with a PC.
> the general line of thought there is that bugs are worked around in software and there's little expectation of being able to run existing code flawlessly, unlike with a PC.
Nowadays there’s hardly a device that can’t easily be updated after shipment - so the cost and effort required to make a perfect error-free CPU is not as incentivezed.
The updates are often fatal, though. These include things like the Opteron "Barcelona" TLB bug and the first-generation EPYC "Naples" frequency scaling bug. The fix for the former knocked 20% off the performance of that generation of parts, and the fix for the latter meant that you had to run at the base clock frequency at all times, getting neither turbo boosts nor power savings. If you apply all of the speculative execution workarounds to an older Intel part like Xeon E5 v3 you will lose something like a quarter of the performance you paid for originally.
Yeah, I was thinking more of architecture errors where the solution is “modify the compiler so that code isn’t called” - though some may allow microcode updates.
The various spec-ex workarounds actually matter more on things like cloud servers than they do on dedicated/controlled hardware.
If I'm reading that page right, the update is 2 kB but it does not contain a full set of microcode, only some sort of patch e.g. only for the instructions that need fixes.
Interesting. So a CPU at version N is updated by a sequence of N patches applied in order, where each patch is a pair of location and code-as-data, plus noise (gotta confuse the competition, and hackers I guess).
There's a base microcode in ROM and the patches are stored in a small piece of RAM built into the CPU, the whole sequence of patches is applied on every boot. Generally the BIOS will apply all the patches it has stored and then the OS will finish the sequence if the BIOS version wasn't already the latest.
It's a long list for some CPUs, i.e. Sandy Bridge was released in 2011 and got its most recent microcode update in 2020.
> Nowadays there’s hardly a device that can’t easily be updated after shipment - so the cost and effort required to make a perfect error-free CPU is not as incentivezed.
This ignores the fact that there can be security exploits.
(The code below is likely to have bugs of its own - I wrote it from memory as an illustration of the CPU bug - and thanks to 'tlb' for catching an error in my first draft. I also left out the question of what data segment the various MOV instructions use for their memory references, as it isn't relevant to this CPU bug.)
If you needed to work in a different stack from the one you were currently running on, you might do something like this:
This saves the original SP (Stack Pointer) register, loads it with your private value, and then restores SP when you are done.Suppose you wanted to switch not only to your own stack pointer but also your own stack segment. With 16-bit registers you could only address 64KB at a time, and you would need to change a segment register to access memory outside that range.
So you would save, change, and restore both the SS (Stack Segment) and SP registers:
Now imagine that an interrupt triggered in between one of the changes to SS and the matching change to SP. The interrupt code would now be running on the new stack segment but the old stack pointer, corrupting memory and crashing.Not to worry! Intel had your back. The documentation promised that after a MOV SS or POP SS, interrupts would automatically be disabled until the next instruction (the matching MOV SP or POP SP) completed.
But they kinda forgot to implement that feature. So if you followed the docs, you would have these very rare and intermittent crash bugs.
Word got around fairly soon, and the fix was simple enough, disable interrupts yourself around the paired instructions:
This still left you unprotected against NMI (Non-Maskable Interrupt), but by the time most of us built NMI switches for our IBM PC's, we'd also upgraded to newer CPUs with this bug fixed. It was only the earliest 8088s (and perhaps 8086s) that had the bug.