That's… an assumption. At least 3 motherboard vendors are affected, and going by the Gigabyte/MSI workarounds at the end of the article, it looks like things need to be adjusted away from Intel defaults.
…it'll need a statement from Intel for some clarity on this…
> "Intel's default maximum TDP for the 13900K is 253 watts, though it can easily consume 300 watts or more when given a higher power limit. In our testing, manually setting the power limit to 275–300 watts and the amperage limit to 350A, proved to be perfectly stable for our 13900K. That required going into the advanced CPU settings in the BIOS to change the PL1/PL2 limits — called short and long duration power limits in our particular case. The motherboard's default "Auto" power and current limits meanwhile created instability issues — which correspond to a power limit of 4,096 watts and 4,096 amps." [0]
The motherboard manufacturers are setting default/auto power and current limits that are way outside of Intel's specs (253 W, 307 A) [1].
Drawing peak power far in excess of the TDP is what all Intel processors have been designed to do for many years now.
Some consider it cheating the benchmarks, but the justification is that TDP is the Thermal Design Power. It's about the cooling system you need, not the power delivery. If you make reasonable assumptions about the thermal inertia of the cooling system you can Turbo Boost at higher power and hope the workload is over before you are forced to throttle down again.
Any mainboard that sets power limits to the TDP would be considered wrong by both the community and Intel. This looks like a solid indication that the issue is with Intel
> It's not exactly clear why the 13900K suffers from these instability problems, and how exactly downclocking, lowering the power/current limits, and undervolting prevent further crashes. Clearly, something is going wrong with some CPUs. Are they "defective" or merely not capable of running the out of spec settings used by many motherboards?
> Are they "defective" or merely not capable of running the out of spec settings used by many motherboards?
I'd wager good money on the latter. Why would Intel validate their CPUs against power and current limits that are outside of spec? The users reporting issues probably have CPUs that just made it in to the performance envelope to be binned as a 13900K, so running out of spec settings on these weaker chips results in instability.
It's cases like this where I wish Intel didn't exit the motherboard space, they were known to be reliable but typically at the cost of having a more limited feature set.
Don't guess, measure! The proper action here would be to change BIOS settings from their default / "auto" settings to per-Intel-spec safe ones. Same for RAM, and on systems with known good power supplies, CPU cooling, software installs etc. Then one of the following will happen:
a) BIOS ignores user settings & problem persists.
b) BIOS applies user settings & problem goes away.
c) BIOS applies user settings but problem persists.
Cases a & b count as "faulty BIOS" (motherboard manufacturer caused). Case c counts as "faulty CPU", and replacement cpu may or may not fix that.
No need to guess. Just do the legwork on systems where problem occurs & power supply, RAM, CPU cooling & OS install can be ruled out. Sadly, no doubt there's many systems out there where that last condition doesn't hold.
I have a 13900K. The default BIOS settings set a maximum wattage of 4096W (!!!) that makes Prime95 fail. If I change the settings back to 253W, what Intel says is the maximum wattage, Prime95 stops failing.
Still, I don't know if I should RMA. I got the K version because I intended to overclock in the future. And all of this sounds like I won't be able to. I think increasing the voltage a little bit makes the system more stable. I have to play with it. (Really, if someone can say whether I should RMA or not, I would appreciate some input)
Edit: decided to RMA. I have no patience for a CPU that cost me +600€
I don't disagree, but I'm cautious about making a call with the current information available. For example: yes, a "4096W / 4096A" power limit sounds odd, but it's not an automatic conclusion that this limit is intended to work to protect the CPU. Instead, it is a function that allows building a system with a particular PSU dimension — it would be odd if that were overloaded to protect the chip itself. Maybe it is, maybe it isn't.
It's also very much possible that the M/B vendors altered other defaults, but… I don't see information/confirmation on that yet. It used to be that at least one of the settings is the original CPU vendor default, but last I looked at these things was >5 years ago :(.
> It's cases like this where I wish Intel didn't exit the motherboard space,
Modern CPUs have many limits to protect the CPU and later the clock behavior. For example, clock limit, current limit (IccMax, 307A for 13900K), long power limit (PL1, 125W), short power limit (PL2, 253W), transient peak limit (PL3), overcurrent limit (PL4), thermal limit (TjMax, 100c), Fast Throttle threshold (aka Per-Core Thermal Limit, 107c), etc. It also has Voltage/Frequency curves (V/F curve) to map how much voltage needs to drive a certain frequency.
Intel 13900K has a fused V/F curve until its maximum Turbo Boost 2.0 (5.5 GHz) in all cores, and two cores at its Thermal Velocity Boost (aka favored cores, 5.8 GHz). How much to boost depends on By Core Turbo Ratio. For stock 13900K, this is 5.8 GHz for 2 cores, and 5.5 GHz for up to 8 cores with E-cores capped at 4.3 GHz.
As you may have noticed, the CPU has a very coarse Turbo Ratio beyond the first 2 cores. This is to allow the clock to be regulated by one of the limits rather than a fixed number. In reality, 253W PL2 can sustain around 5.1 GHz all P-cores, and after 56 seconds it will switch to 125W PL1 which should give it around 4.7 GHz-ish (IIRC).
This is why when a motherboard manufacturer decides to set PL1=PL2=4096 without touching other limits, it results in a higher number in benchmark. The CPU will consume as much power as it can to boost to 5.5 GHz, until it hits one of the other limits (usually 100c TjMax). This is how we ended up in this mess in the consumer market.
Xeon, on the other hand, has a very conservative and granular Turbo Ratio. My Xeon w9-3495x do have a fused All Core Boost that does not exceed PL1 (56 cores 2.9 GHz at 350W), which makes PL2 exist only for AVX512/AVX workload.
(Side note: I always think that PL1=PL2=4096W is dumb since performance gain is marginal at best, and always set PL1=PL2=253W in all machines that I assembled. I think even PL1=PL2=125W makes sense for the most usage. I do overclock my Xeon to sustain PL1=PL2=420W though (this is around 3.6 GHz, which is enough to make it faster than 64-cores Threadripper 5995WX))
Jesus. By German electrical code, you need a 70 mm² cross-section of copper to transfer that kind of current without the cable heating up to a point that it endangers the insulation. How do mainboard manufacturers supply that kind of current without resistive loss from the traces frying everything?
The traces are extremely short. Look at a modern motherboard and you'll find a bank of capacitors and regulators about 2cm away from the CPU socket.
If you've got 4 layers of 2oz copper, and you make the positive and negative traces 10mm wide, you'll only be dissipating 28 watts when the CPU is dissipating 300 watts. And most motherboards have more than 4 layers and have space for more than 10mm of power trace width. And there's a bunch of forced air cooling, due to that 300 watts of heat the CPU is producing.
Electrical code doesn't let buildings use cables that dissipate 28 watts for 2cm of distance because it would be extremely problematic if your 3m long EV charge cable dissipated 4200 watts.
That code is for round wire (minimal surface area per volume) that can be placed inside insulation in walls.
This 350A is flat conductors (maximal surface area thus heat dissipation) and very short (not that much power to dissipate so the things it connects to have a significant effect on heat dissipation).
Bursty current spikes, short and fat traces, using the motherboard as a heat sink, active cooling, and allowing the temperature to rise quite a bit. If you look at thermal camera videos[0], it pretty clear where all the heat is going (although a significant part of that is coming from the voltage regulators).
On the other hand, your national electrical code is going to assume you're running that 350A cable at peak capacity 24/7, right next to other similarly-loaded cables, stuffed in an isolated wall, for very long runs - and it still has to remain at acceptable temperatures during a hot summer day.
I agree. However voltage is relevant for insulation, which also affects how heat can dissipate for the wire, and might also be relevant for a failing wire, when higher and higher voltage can build up at the point of failure (not sure if it's a common engineering consideration outside of fuses, which are designed to fail).
At 300W there's only so much power that can become heat. With 80 kW flowing through the wire, if the insulation melts due to excessive wire resistance you call the fire brigade.
> The motherboard manufacturers are setting default/auto power and current limits that are way outside of Intel's specs
The CPU only draws as much power as it needs, though?
I mean, if you plug a 20 watt phone into a 60 watt USB-C power supply, or a 60 watt laptop into a 100 watt USB-C power supply the device doesn't get overloaded with power. It draws no more current than it needs.
The motherboard's power limits should state the amount of power the PCB traces and buck regulators are rated to provide to the socket - and if that's more than the processor needs that's good, as it avoids throttling.
* User is running a demanding application (e.g. game).
* CPU clock speed increases (turbo boost), as long as the CPU isn't hitting: 1) Tj_MAX (max temp before thermal throttling kicks in); 2) the power and current limits specified by the motherboard (in this case, effectively disabled by the out of spec settings).
* Weaker chips will require more power to hit or maintain a given turbo clock speed: with the power and current limits disabled, the CPU will attempt to draw out of spec power & current, causing issues for the on die fully-integrated voltage regulator (noting that there's also performance/quality variance for the FIVR), resulting in the user experiencing instability.
Boss says, "Do the thing." Engineer says, "The thing is out of spec!" Boss says, "Competitor is doing the thing already and it works." Engineer does the thing.
Yes, but also no. The motherboard market is "timing-competitive", your product needs to be ready when the CPU launches, especially for the kind of flagship CPU that this specific issue is about. You can't wait and see what the competitors are doing.
all three motherboard vendors enabling some out-of-spec defaults wouldn't actually be surprising, though?
people forget, blowing up AM5 cpus wasn't just as Asus thing... they were just the most ham-handed with the voltages. Everyone was operating out of spec, there were chips that blew up on MSI and Gigabyte boards, and it wasn't just X3D either.
Intel is no different - nobody enforces the power limit out of the box, and XMP will happily punch voltages up to levels that result in eventual degradation/electromigration of processors (on the order of years). Every enthusiast knows that CPU failures are "rare" and yet either has had some, or knows someone who's had some in their immediate circles. Because XMP actually has caused non-trivial degradation even on most DDR4 platforms.
In fact it's entirely possible that this is an electromigration issue right here too - notice how this affects some 13700Ks and 13900Ks too? Those chips have been run for a year or two now. And if the processors were marginal to begin with, and operated at out-of-spec voltages (intentionally or not)... they could be starting to wear out a little bit under the heaviest loads. Or the memory controllers could be starting to lose stability at the highest clocks (under the heaviest loads). That's a thing that's not uncommon on 7nm and 5nm tier nodes.
This is blowing up now, but the first report of this kind of issue that reached me (I'm the current Oodle maintainer) was in spring of last year. We've been trying to track it down (and been in contact with Intel) since then. The page linked in the OP has been up since December.
Epic Games Tools is B2B and we don't generally get bug reports from end users (although later last year, we did have 2 end users write to us directly because of this problem - first time this has happened for Oodle that I can think of, and I've been working on this project since 2015). Point being, we're normally at least one level removed from end user bug reports, so add at least a few weeks while our customers get bug reports from end users but haven't seen enough of them yet to get in touch with us (this is a rare failure that only affects a small fraction of machines).
13900Ks have been out since late Oct 2022. It's possible that this doesn't show up on parts right out of the box and takes a few months. It's equally plausible that it's been happening for some people for as long as they've had those CPUs, and the first such customers just bought their new machines late 2022, maybe reported a bug around the holidays/EOY that nobody looked at until January, and then it took another 2-3 months for 3-4 other similar crashes to show up that ultimately resulted in this case getting escalated to us.