Chip aging becomes design problem (2018)

Robotbeat · on Jan 12, 2022

2-3 years for a consumer device and 10 for a telecommunications device s em both massively too short by a factor of 2 or more.

Do we really want our devices to stop working physically, at the chip level, in less than a decade? Some people still play on game consoles decades old; these are most certainly consumer devices.

I’d much rather we over-design stuff to last decades, at least at the chip level where overedesigning is super cheap.

Solid state electronics was supposed to mean longer life for everything. No tubes to burn out, no mechanical parts to wear out. It would suck so bad if we nickel and dimed what should be fundamentally physically robust devices to last for much less time than complicated mechanical devices from the past.

jgeada · on Jan 13, 2022

Note that this is the minimum lifetime. Semiconductor aging is a statistical process, some devices will happily last much longer, but there is a design minimum lifetime that has to be accounted for and met.

Also, the aging calculations are done under worst possible conditions, ie the computation is for 2years active life for a consumer device, at highest operating voltage, highest operating frequency and highest operating temperature. The time while it is powered down or sleeping isn't aging the device and so doesn't impact lifetime. The actual working life in your hands under regular usage will be (much) longer. There is a reason your average phone works fine for 5+ years from hardware perspective, even if the software side sometimes has shorter planned obsolescence (lack of updates, or updates that demand more from the device than that generation of chip was capable of delivering, etc)

For telecommunications, or automotive usage, and in particular for anything that might be failure critical, the testing and metrics are much more onerous, and the minimum guaranteed lifetimes higher. This partly explains why such devices, even when manufactured at advanced nodes, don't exhibit the same performance as consumer devices: some of the performance is being held in reserve to account for the natural degradation due to aging.

Source: I'm one of the people quoted in the original article.

userbinator · on Jan 13, 2022

Solid state electronics was supposed to mean longer life for everything.

I remember the same being said about LEDs (and before that, CFLs) for lighting, that they would have a much longer life, but in practice, they didn't precisely because making something that lasts longer means less profit in the long term.

Robotbeat · on Jan 13, 2022

They do seem to last longer, but there is a wide distribution in lifetimes. I get bad batches that last only months but some bulbs that last years or over a decade of use.

Crosseye_Jack · on Jan 13, 2022

In my experience it’s down to the race to the bottom. Cheap power supplies, poor heat sinking, over driving the LEDs so they can use less of them but keep the same lumens. People don’t seem to mind too much when replacements cost a couple of bucks.

Better quality bulbs tend to last much longer, personally I’m waiting for the “Dubai Lamps” to become available here and I’ll prob slowly switch over to those (shame they are not dimmable). Although I have to admit the supermarket own brand LED bulbs that I currently have in use have put up a much better fight then I initially gave them credit for.

https://hackaday.com/2021/01/17/leds-from-dubai-the-royal-li...

https://www.mea.lighting.philips.com/consumer/dubai-lamp

userbinator · on Jan 13, 2022

Some previous discussion on that here:

https://news.ycombinator.com/item?id=27093793

It's interesting that the Dubai Lamps only use a relatively simple capacitive dropper with a linear post-regulator --- so it's probably actually cheaper to produce than some of the other LED bulbs with complex switchmode power supplies.

The cheapest LED bulbs use a capacitive dropper too, but they drive the LEDs very hard; bigclivedotcom (same person who did the video featured in hackaday there) has a few other videos showing how they can be modified for lower power consumption and thus correspondingly long life.

Crosseye_Jack · on Jan 13, 2022

> so it's probably actually cheaper to produce than some of the other LED bulbs with complex switchmode power supplies.

Nothing wrong with capacitive droppers for things like LED bulbs, my point was more about the quality of the components used in the capacitive dropper.

It’s my own antidotal evidence but I think I’ve had one LED bulb where the LEDs have failed and the rest have been a bad power supply (well not even “bad” more “aged out”). Because of their construction fixing the bulb is often more hassle then it’s worth except for the learning experience.

If you are making such products and you are completing in the in the low price point market, shaving cents off the cost of your caps soon adds up. However once you are no longer counting pennies you don’t mind splashing out on better quality caps.

fomine3 · on Jan 13, 2022

In my experience, LEDs itself are rarely dead but other components tend to be dead

nickff · on Jan 12, 2022

>"Do we really want our devices to stop working physically, at the chip level, in less than a decade? Some people still play on game consoles decades old; these are most certainly consumer devices."

Many components in your consumer devices age; my greatest concern is electrolytic capacitors (specifically the tantalum ones). I think many game consoles and computers with switched-mode power supplies are unlikely to last decades. If you want a simple way to get your devices to last longer, I suggest that you pick ones with external (brick) power supplies, as the SMPS is likely to be one major cause of failures.

>"I’d much rather we over-design stuff to last decades, at least at the chip level where overedesigning is super cheap."

I am not sure that people are willing to pay significantly more for these longer-lasting devices, and all that extra effort will be wasted if the devices are scrapped prematurely. It may be more (environmentally) efficient to simply replace the devices upon failure rather than over-designing them.

ece · on Jan 12, 2022

I've now had two phones that with mild and bad water damage, have had their cellular connections stop working, while WiFi and BT work just fine. Anecdotally, I'd like to say this might have something to do with the RFFE's analog components and that water proofing seems like a very worthwhile thing to aim for in handheld electronics.

I suppose what I'm saying here isn't new, the environmental factors that a device is most likely to encounter is something people do account for. I bought a 50degC ambient PS because I didn't want to worry about heat, and a waterproof phone, and a better cooler for OC. I imagine such buying will continue to outweigh any concerns for 5nm chips not lasting long enough for some time still. Which is not to say that semi companies don't have challenges in front of them.

Syonyk · on Jan 12, 2022

> Do we really want our devices to stop working physically, at the chip level, in less than a decade?

Unfortunately, "we" aren't really considered in the decisions. The math from a company's perspective is mostly, "Will this last the expected warranty period?" And, to a lesser extent, "Will the ill will caused by these catastrophically failing right outside warranty be a problem?" See RRoD and such.

But the problem is that "solid state" does wear out - this article is a list of causes of it. It's reasonably true that the older solid state technologies lasted extremely long, but as you start to push them, they don't last as long, and even things like power transistors eventually start to fail - there are a few companies that rebuild old Tesla Roadster power conversion equipment, because the transistors wear out and fail (at least one of those companies seems to regularly burn down their shop as well).

> It would suck so bad if we nickel and dimed what should be fundamentally physically robust devices to last for much less time than complicated mechanical devices from the past.

For you, sure. For the people in business selling replacement, a widget that lasts 30 years is an annoying pain in the ass to them.

GMC used to build transit buses (beautiful aluminum chassis, slanted windows, the works), and eventually stopped. Talking to the bus mechanics when I used to drive, their theory was that GMC couldn't sell new ones, because damned near every single one they'd made was still on the road. They didn't corrode, and a properly maintained chassis would last basically forever. In the early 2000s, we had 40 year old buses that were on their 4th engine, 3rd transmission, up near a million miles. They just kept trucking along.

Now, it seems consumer stuff is lucky to last 5 years before needing major repairs.

zozbot234 · on Jan 12, 2022

Many of these effects are highly temperature- and voltage-dependent, so they're way more likely to show up in chips that are overvolted, overclocked and inadequately cooled. But we're yet to see overclocking enthusiasts run into these reliability concerns, so my best guess is that the safety factors embedded in current designs are enough to make them less of an issue.

boznz · on Jan 12, 2022

I'm an EE and never had a chip fail when working inside its recommended operating parameters and properly protected and I have devices in the field over 20 years, I do see failures when poor soldering or electrolytic gone bad compromises the power supply. My main worry is some flash memory is only guarenteed minimum 25 years, Not sure how they determined this but yet to see any issues.

I would hate to be on a generational spaceship or the first settler on another planet with currently designed commercial stuff.

Robotbeat · on Jan 13, 2022

Flash is indeed a weak link. I use to put together embedded Linux systems out of commodity server components 15 years ago, and over the years we lost many flash chips (altho they would often continue functioning in a degraded state… read only) due to the high logging rate of Linux combined with the low write endurance of typical usb flash disks. You can get more expensive single level cell (or pseudo-single level cell) chips now that can last an extremely long time.

Old Tesla Model Ses had this same problem with the infotainment system flash, but Tesla will fix it for you for free now since the infotainment system contains some features NHTSA considers safety essential (wipers, etc).

piyh · on Jan 13, 2022

I'm only buying high endurance memory now. I've had enough cheap usb and sd cards die on me. Losing my data is not worth saving $10.

ece · on Jan 13, 2022

Good flash atleast goes readonly (Sandisk USBs in my experience), bad flash, not so much. The controller is probably playing a part as well.

Wowfunhappy · on Jan 13, 2022

> But we're yet to see overclocking enthusiasts run into these reliability concerns

Isn’t that also because said enthusiasts tend to use massive coolers and monitor temperatures to keep their cpus as cool as possible?

I know that I for one would never let a cpu on my desktop get over 80 degrees, but it’s common for laptops to go to 100.

qsmi · on Jan 13, 2022

Probably. From a follow on article,

"The biggest factor is heat. “Higher speeds tends to produce higher temperatures and temperature is the biggest killer,” says Rita Horner, senior product marketing manager for 3D-IC at Synopsys. “Temperature exacerbates electron migration. The expected life can exponentially change from a tiny delta in temperature.”"

https://semiengineering.com/aging-problems-at-5nm-and-below/

Scene_Cast2 · on Jan 13, 2022

"overvolted, overclocked and inadequately cooled" - sounds like a default GPU setup (only half-joking). IIRC the optimal performance per watt for a modern consumer GPU is at half the stock TDP, according to MonsterLabo.

zozbot234 · on Jan 13, 2022

People have experimented with GPU's that have been used for crypto mining (kind of a worst case scenario) and haven't managed to hit any reliability concerns so far - the cards seem to work just fine.

ya_throw · on Jan 13, 2022

They also undervolt their cards, because it's more efficient.

kasabali · on Jan 13, 2022

> we're yet to see overclocking enthusiasts run into these reliability concerns

Several people on Reddit reported their Zen 2 cpus degraded within few months when overvolted.

nashashmi · on Jan 12, 2022

They don't stop working after 2_3 years. They just don't work as fast. The electronics industry wants to promote change of products. so they do this. Or they don't work on long term durability.

My take: for as long as recycling performance is terrible, this should be a no no. Social movements should start demanding better products.

qsmi · on Jan 13, 2022

> I’d much rather we over-design stuff to last decades, at least at the chip level where overedesigning is super cheap.

Engineers are working on it.

From a follow on article by the same author.

"An emerging alternative is to build aging sensors into the chip. “There are sensors, which usually contain a timing loop, and they will warn you when it takes longer for the electrons to go around a loop,” says Arteris IP’s Shuler. “There is also a concept called canary cells, where these are meant to die prematurely compared to a standard transistor. This can tell you that aging is impacting the chip. What you are trying to do is to get predictive information that the chip is going to die. In some cases, they are taking the information from those sensors, getting that off chip, throwing it into big database and running AI algorithms to try to do predictive work.”

https://semiengineering.com/aging-problems-at-5nm-and-below/

ya_throw · on Jan 13, 2022

>Solid state electronics was supposed to mean longer life for everything.

Sorry to be cynical, but says who? The marketing department? Product lifecycle is a function of engineering design, and not necessarily material choice. A Detroit Diesel engine will run basically forever, yet I would be surprised if many of the newest electronic "no moving parts" widgets will last more than a few years. If fewer moving parts is inherently more reliable, then ask an EV owner why charging points are so frequently broken. After all, they only have one moving part!

The last few days before the Anti Singularity will be terrible. We will be desperately trying to complete the next generation of engineering designs, while our current systems crumble and age before our very eyes.

ohazi · on Jan 12, 2022

How long should I expect to be able to use a new 5nm CPU (at reasonable temperatures) before these issues are likely to make it fail?

All of the desktop/laptop CPUs that I currently use are 14nm, and I think the oldest is around 7 years old and still working fine. In the past I've tended to use personal machines for around a decade, and I don't really have any desire to move to a shorter cycle. Better battery life is great, but most things are already plenty thin and fast.

piyh · on Jan 13, 2022

It's all about duty cycle. Idle browsing is not taxing.

tester756 · on Jan 12, 2022

Reminds me of

[Cores that don't count]https://research.google/pubs/pub50337/

>We are accustomed to thinking of computers as fail-stop, especially the cores that execute instructions, and most system software implicitly relies on that assumption. During most of the VLSI era, processors that passed manufacturing tests and were operated within specifications have insulated us from this fiction. As fabrication pushes towards smaller feature sizes and more elaborate computational structures, and as increasingly specialized instruction-silicon pairings are introduced to improve performance, we have observed ephemeral computational errors that were not detected during manufacturing tests. These defects cannot always be mitigated by techniques such as microcode updates, and may be correlated to specific components within the processor, allowing small code changes to effect large shifts in reliability. Worse, these failures are often "silent'': the only symptom is an erroneous computation.

>We refer to a core that develops such behavior as "mercurial.'' Mercurial cores are extremely rare, but in a large fleet of servers we can observe the correlated disruption they cause, often enough to see them as a distinct problem -- one that will require collaboration between hardware designers, processor vendors, and systems software architects.

>We have observed various kinds of symptoms caused by mercurial cores.

> Violations of lock semantics leading to application data corruption and crashes.

> Data corruptions exhibited by various load, store, vector, > and coherence operations.

> A deterministic AES mis-computation, which was “self inverting”: encrypting and decrypting on the same core yielded the identity function, but decryption elsewhere yielded gibberish.

> Corruption affecting garbage collection, in a storage system, causing live data to be lost.

> Database index corruption leading to some queries, depending on which replica (core) serves them, being non deterministically corrupted.

> Repeated bit-flips in strings, at a particular bit position (which stuck out as unlikely to be coding bugs).

> Corruption of kernel state resulting in process and kernel crashes and application malfunctions

___________

>Not all mercurial-core screening can be done before CPUs are put into service – first, because some cores only become defective after considerable time has passed,

GeekyBear · on Jan 13, 2022

I wondered it this might be a reason for Intel disabling AVX-512 on it's most recent chips.

>Electromigration is the movement of atoms based on the flow of current through a material. If the current density is high enough, the heat dissipated within the material will repeatedly break atoms from the structure and move them. This will create both ‘vacancies’ and ‘deposits’. The vacancies can grow and eventually break circuit connections resulting in open-circuits, while the deposits can grow and eventually close circuit connections resulting in short-circuit...

In Black’s equation, which is used to compute the mean time to failure of metal lines, the temperature of the conductor appears in the exponent, i.e. it strongly affects the MTTF of the interconnect

https://www.synopsys.com/glossary/what-is-electromigration.h...

Intel's new chips already run hot and running AVX-512 instructions has required increasing the voltage.

>One of the big takeaways from our initial Core i7-11700K review was the power consumption under AVX-512 modes, as well as the high temperatures. Even with the latest microcode updates, both of our Core i9 parts draw lots of power. The Core i9-11900K in our test peaks up to 296 W, showing temperatures of 104ºC, before coming back down to ~230 W and dropping to 4.5 GHz.

There are a number of ways to report CPU temperature. We can either take the instantaneous value of a singular spot of the silicon while it’s currently going through a high-current density event, like compute, or we can consider the CPU as a whole with all of its thermal sensors. While the overall CPU might accept operating temperatures of 105ºC, individual elements of the core might actually reach 125ºC instantaneously. So what is the correct value, and what is safe?

https://www.anandtech.com/show/16495/intel-rocket-lake-14nm-...

Lind5 · on Jan 12, 2022

also relevant & more recent https://semiengineering.com/reliability-concerns-shift-left-...

ggm · on Jan 13, 2022

Anyone else holding a Sun Ultra-5 or Ultra-10 where the MAC address has zero'ed out?

Sometimes, the reason you can't be bootstrapped is as simple as a soldered-on battery backup

(there's a hex boot load sequence to give your host a self-assigned MAC and get over this problem)

star-trek-fleet · on Jan 12, 2022

Semi conductor newbie here.

Reading this, I am wondering what failures in the transistor (field effect transistor) can be tolerated by the circuit?

I forgot all the analog and digital circuitry classes in college. I don't recall there is any redundancy in the circuit design.

Is redundancy built in the circuit design therefore the chip?

ted_dunning · on Jan 13, 2022

For some circuits, there can be substantial redundancy. This includes memory, disks and networks. Sometimes that redundancy is in hardware, sometimes software and sometimes a mix.

For other hardware, there is almost the of redundancy. In those parts of the system, you depend on multiple components all working correctly with no chance of detecting errors at the circuit level. This means that if you have 10 components with expected life of 6-20 years (with a mean of 10) you can expect an actual life of about 6, not the mean life at all. The weakest link and all that.

ccbccccbbcccbb · on Jan 12, 2022

Planned obsolescence.

This may well end up with CPU vendors adopting food storage cant.

"Ignel i11-23017K, MFD: 2025.05.01, EXP: 2026.09.03, consume within 3 months of first power-on".

trasz · on Jan 12, 2022

So, how serious it is for current and upcoming hardware, realistically?

monocasa · on Jan 12, 2022

A real problem for smaller nodes, 7nm and under starts really cutting into expected lifetime.

Animats · on Jan 12, 2022

This is a big problem for automotive. The average age of cars in the US is over 12 years now. Which is a technical achievement. It was around 6 in the 1970s. Longer for trucks.

Automotive electronics design lives used to be much longer. I happen to know that the design life for the Ford EEC IV engine control unit from the 1980s was 30 years. That's been exceeded in the field; many 1980s Ford trucks with that unit are still running.

There are now cars running electronics that's way overkill in performance, probably at the cost of lifetime. Unreal Engine in the dashboard is a thing.

Realistically, you need 20 years of life in automotive electronics.

versteegen · on Jan 12, 2022

It would be fantastic, market changing, to know what cars or other consumer goods were actually designed for, but I don't have hope for that.

Honestly, I wouldn't buy a car if I knew its electronics were only designed to last 20 years. I wouldn't expect it to be reliable as it approaches that age. Mind you, I drive a 1994 Toyota Corolla, which has needed about $50 of repairs in the last decade, so I have high standards.