CPU Bugs (2018)

Stratoscope · on May 24, 2021

The 8088 processor in the first IBM PC had a bug that gave me some grief.

(The code below is likely to have bugs of its own - I wrote it from memory as an illustration of the CPU bug - and thanks to 'tlb' for catching an error in my first draft. I also left out the question of what data segment the various MOV instructions use for their memory references, as it isn't relevant to this CPU bug.)

If you needed to work in a different stack from the one you were currently running on, you might do something like this:

  mov saveSP, sp
  mov sp, mySP
  ...
  mov sp, saveSP

This saves the original SP (Stack Pointer) register, loads it with your private value, and then restores SP when you are done.

Suppose you wanted to switch not only to your own stack pointer but also your own stack segment. With 16-bit registers you could only address 64KB at a time, and you would need to change a segment register to access memory outside that range.

So you would save, change, and restore both the SS (Stack Segment) and SP registers:

  mov saveSS, ss
  mov saveSP, sp
  mov ss, mySS
  mov sp, mySP
  ...
  mov ss, saveSS
  mov sp, saveSP

Now imagine that an interrupt triggered in between one of the changes to SS and the matching change to SP. The interrupt code would now be running on the new stack segment but the old stack pointer, corrupting memory and crashing.

Not to worry! Intel had your back. The documentation promised that after a MOV SS or POP SS, interrupts would automatically be disabled until the next instruction (the matching MOV SP or POP SP) completed.

But they kinda forgot to implement that feature. So if you followed the docs, you would have these very rare and intermittent crash bugs.

Word got around fairly soon, and the fix was simple enough, disable interrupts yourself around the paired instructions:

  mov saveSS, ss
  mov saveSP, sp
  cli
  mov ss, mySS
  mov sp, mySP
  sti
  ...
  cli
  mov ss, saveSS
  mov sp, saveSP
  sti

This still left you unprotected against NMI (Non-Maskable Interrupt), but by the time most of us built NMI switches for our IBM PC's, we'd also upgraded to newer CPUs with this bug fixed. It was only the earliest 8088s (and perhaps 8086s) that had the bug.

tlb · on May 24, 2021

Why does the pop at the end of:

  push sp
  mov sp, myPrivateSP
  ...
  pop sp

work? Isn't it popping from the private stack, while it was pushed on the regular stack?

Stratoscope · on May 24, 2021

Oh, good catch! I was doing this from memory, and definitely have a bug there.

Updated now, hopefully this will be a more plausible example. Let me know if you spot something else! :-)

anonymousiam · on May 24, 2021

"As someone who worked in an Intel Validation group for SOCs until mid-2014 or so I can tell you, yes, you will see more CPU bugs from Intel than you have in the past from the post-FDIV-bug era until recently."

A most prescient remark in 2014.

Here's where they are more recently:

https://www.zdnet.com/article/intel-fixed-236-bugs-in-2019-a...

https://www.techradar.com/news/latest-intel-cpus-have-imposs...

Flow · on May 24, 2021

When this news broke I though Intel lost their mind.

Did they really intend to just "skip" validation or did they try to automate it further, to decrease time to produce a new chip?

hulitu · on May 24, 2021

Testing is expensive. That's why it has a great potential for savings.

Flow · on May 24, 2021

I think Intel was more concerned about the time it took to make a new CPU rather than the cost. At least that was my impression of it at the time.

That testing is a cost is a given. But it's a known cost compared to what a huge batch of faulty CPU's can cost. Or how about a ruined reputation, how do you even know what that could cost you?

I suppose Intel already use a lot of automated testing, but given all the bugs since the change it seems it is not enough.

userbinator · on May 23, 2021

It's funny to hear that the bug increases are an effect of Intel trying to compete with ARM SoCs in mobile devices, because the errata those have are much worse --- and indeed a lot of embedded stuff is like that because the general line of thought there is that bugs are worked around in software and there's little expectation of being able to run existing code flawlessly, unlike with a PC.

amelius · on May 24, 2021

> the general line of thought there is that bugs are worked around in software and there's little expectation of being able to run existing code flawlessly, unlike with a PC.

How does that work for Apple's M1?

saagarjha · on May 24, 2021

https://github.com/apple/darwin-xnu/blob/8f02f2a044b9bb1ad95...

bombcar · on May 23, 2021

Nowadays there’s hardly a device that can’t easily be updated after shipment - so the cost and effort required to make a perfect error-free CPU is not as incentivezed.

jeffbee · on May 23, 2021

The updates are often fatal, though. These include things like the Opteron "Barcelona" TLB bug and the first-generation EPYC "Naples" frequency scaling bug. The fix for the former knocked 20% off the performance of that generation of parts, and the fix for the latter meant that you had to run at the base clock frequency at all times, getting neither turbo boosts nor power savings. If you apply all of the speculative execution workarounds to an older Intel part like Xeon E5 v3 you will lose something like a quarter of the performance you paid for originally.

bombcar · on May 24, 2021

Yeah, I was thinking more of architecture errors where the solution is “modify the compiler so that code isn’t called” - though some may allow microcode updates.

The various spec-ex workarounds actually matter more on things like cloud servers than they do on dedicated/controlled hardware.

mrslave · on May 24, 2021

When microcode is so small (2K according [0]) how can it work to enable/disable specific instructions, or even change how they work?

[0] https://en.wikipedia.org/wiki/Intel_Microcode

Liquid_Fire · on May 24, 2021

If I'm reading that page right, the update is 2 kB but it does not contain a full set of microcode, only some sort of patch e.g. only for the instructions that need fixes.

mrslave · on May 24, 2021

Interesting. So a CPU at version N is updated by a sequence of N patches applied in order, where each patch is a pair of location and code-as-data, plus noise (gotta confuse the competition, and hackers I guess).

opencl · on May 24, 2021

There's a base microcode in ROM and the patches are stored in a small piece of RAM built into the CPU, the whole sequence of patches is applied on every boot. Generally the BIOS will apply all the patches it has stored and then the OS will finish the sequence if the BIOS version wasn't already the latest.

It's a long list for some CPUs, i.e. Sandy Bridge was released in 2011 and got its most recent microcode update in 2020.

amelius · on May 24, 2021

> Nowadays there’s hardly a device that can’t easily be updated after shipment - so the cost and effort required to make a perfect error-free CPU is not as incentivezed.

This ignores the fact that there can be security exploits.

formerly_proven · on May 23, 2021

Are there ARM CPUs that have upgradeable microcode?

my123 · on May 24, 2021

Yes. NVIDIA Denver/Denver2/Carmel.

Those have microcode that is more extensive than traditional CPUs though.