How the Sun Enterprise 10000 was born (2007)

jasoneckert · 2025-05-19T00:50:27 1747615827

I provisioned, administered, and used so many Sun systems. I loved them so much that I still keep many Sun workstations and servers in my basement in case I need a nostalgia kick. I tried to use Marie Kondo's KonMari method to get rid of them, but it just didn't work... they all SPARC joy.

dmd · 2025-05-18T14:05:32 1747577132

Not the 10000, but I admin'd a 4500 back in 1999 at Bristol-Myers Squibb at the ripe old age of 21. It was running Sun's mail server, which required constant care and feeding to even remotely reliably serve our 30,000+ users.

One time it just stopped responding, and my boss said "now, pay attention" and body-checked the machine as hard as he could.

It immediately started pinging again, and he refused to say anything else about it.

defaultcompany · 2025-05-18T14:25:58 1747578358

This reminds me of the “drop fix” for the sparc station where people would pick up the box and drop it to reseat the PROMs.

linsomniac · 2025-05-18T15:34:17 1747582457

Amiga had a similar issue. One of the chips (fat Agnes IIRC?) didn't quite fit in the socket correctly, and a common fix was to pull out the drive mechanisms and drop the chassis something like a foot onto a carpeted floor.

Somewhat related, one morning I was in the office early and an accounting person came in and asked me for help, her computer wouldn't turn on and I was the only other one in the office. I went over, poked the power button and nothing happened. This was on a PC clone. She has a picture of her daughter on top of the computer, so I picked it up, gave the computer a good solid whack on the side, sat the picture down and poked the power button and it came to life.

We call this: Percussive Engineering

badc0ffee · 2025-05-18T17:35:56 1747589756

Apparently you also had to do this with the Apple ///.

bionsystem · 2025-05-18T14:16:14 1747577774

I can't wait for the mandatory "brendan gregg screams at disks" youtube link.

znpy · 2025-05-18T14:44:51 1747579491

There you go :)

https://www.youtube.com/watch?v=tDacjrSCeq4

(btw it's titled "Shouting in the Datacenter")

bitwize · 2025-05-18T16:14:27 1747584867

Ah, the old "Fus Ro Data Loss" vulnerability.

theideaofcoffee · 2025-05-18T14:12:11 1747577531

Ah, percussive maintenance! Also good for reseating disks that just don’t quite reliably get enumerated, slam the thing back in. I had to do something similar on a power supply for a V440, thankfully it was a month or so away from retirement, I didn’t feel too bad giving it some encouragement like that. Great machines.

eugenekay · 2025-05-18T15:00:56 1747580456

Throughout the late 90s, “Mail.com” provided white-label SMTP services for a lot of businesses, and was one of the early major “free email” providers. Each Free user had a storage limit of something like 10MB, which is plenty in an era before HTML email and attachments were commonplace. There were racks upon racks of SCSI disks from various vendors for the backend - but the front end was all standard Sendmail, running on Solaris servers.

Anyway, here’s the front end SMTP servers in 1999, then in-service at 25 Broadway, NYC. I am not sure exactly which model these were, but they were BIG Iron! https://kashpureff.org/album/1999/1999-08-07/M0000002.jpg

jeffbee · 2025-05-18T15:32:53 1747582373

I worked at a competing white-label email provider in the 90s and even then it seemed obvious that running SMTP on a Sun Enterprise was a mistake. You're not gaining anything from its multiuser single-system scalability. I guess it stands as an early example of pets/cattle debate. My company was firmly on the cattle side.

eugenekay · 2025-05-18T16:04:30 1747584270

I was just the Teenage intern responsible for doing the PDU Cabling every time a new rack was added, since nobody on the Network or Software Engineering teams could fit into the crawl spaces without disassembling the entire raised-floor.

I do know that scale-out and scale-up were used for different parts of the stack. The web services were all handled by standard x86 machines running Linux - and were all netbooted in some early orchestration magic, until the day the netboot server died. I think the rationale for the large Sun systems was the amount of Memory that they could hold - so the user name and spammer databases could be held in-memory on each front end, allowing for a quick ACCEPT or DENY on each incoming message - before saving it out to a mailbox via NFS.

kev009 · 2025-05-19T02:27:12 1747621632

Makes sense, there are a lot of reasons why having some "big iron" might have been practical in that era. x86 was not a full contender for many workloads until amd64, and a lot of the shared-nothing software approaches were not really there until later.

packetslave · 2025-05-18T18:26:48 1747592808

Those look like E5500 or E6500 cabinets (hard to tell from the angle).

eugenekay · 2025-05-18T19:28:53 1747596533

A few months earlier there was only a single “Enterprise 6000” cabinet: https://kashpureff.org/album/1999/1999-05-30/M0000024.jpg

trollied · 2025-05-18T14:20:54 1747578054

I used to love working with E10k/E15k boxes. I was a performance engineer for a telco software provider, and it was so much fun squeezing every single thing out of the big iron.

It’s a bit sad that nobody gives a shit about performance any more. They just provision more cloud hardware. I saved telcos millions upon millions in my early career. I’d jump straight into it again if a job came up, so much fun.

amiga386 · 2025-05-18T16:53:20 1747587200

I used to work for a telco equipment provider around the time everyone was replacing PDH with SONET. Telcos were gagging to buy our stuff, the main reason being basic hardware advances.

Telephone Exchanges / Central Offices have to be in the centre of the lines they serve, meaning some very expensive real estate, and datacenter-level HVAC in the middle of cities is very, very expensive.

They loved nothing more than to replace old 1980s switches with ones that took up a quarter to a tenth of the floorspace, used less than half the electricity, and had fabrics that could switch fibre optics directly.

mlyle · 2025-05-18T15:15:44 1747581344

> It’s a bit sad that nobody gives a shit about performance any more. They just provision more cloud hardware.

It's hard to get as excited about performance when the typical family sedan has >250HP. Or when a Raspberry Pi 5 can outrun a maxxed-E10k on almost everything.

...(yah, less RAM, but you need fewer client connections when you can get rid of them quickly enough).

kstrauser · 2025-05-18T15:30:27 1747582227

My experience was a bit different. I first saw a Starfire when we were deploying a bunch of Linux servers in the DC. The Sun machine was brilliant, fast, enormous, and far more expensive per unit of work than these little x86 boxes we were carting in.

The Starfire started at around $800K. Our Linux servers started at around $1K. The Sun box was not 800x faster at anything than a single x86 box.

It was an impressive example of what I considered the wrong road. I think history backs me on this one.

> It’s a bit sad that nobody gives a shit about performance any more.

Everyone gives a shit about performance at some point, but the answer is horizontal scaling. You can’t vertically scale a single machine to run a FAANG. At a certain vertical scale, it starts to look a helluva lot like horizontal scaling (“how many CPUs for this container? How many drives?”), except in a single box with finite and small limits.

axiolite · 2025-05-18T17:26:28 1747589188

> The Sun box was not 800x faster at anything than a single x86 box.

You don't buy enterprise gear because it's economical for bulk number-crunching... You buy enterprise gear when you have a critical SPOF application (typically the database) that has to be super-reliable, or that requires greater resources than you can get in commodity boxes.

RAS is an expensive proposition. Commodity servers often don't have it, or have much less of it than enterprise gear. Proprietary Unix systems offered RAS as a major selling point. IBM mainframes still have a strong market today.

It wasn't until the late 2000's when x86 went to 64-bit, so if your application wanted to gobble more than 2GB/4GB of RAM, you had to go with something proprietary.

It was even more recently that the world collectively put a huge amount of effort in, and figured out how to parallelize a large amount of number-crunching problems that were previously limited to single-threaded.

There have been many situations like these through the history of computing... Going commodity is always cheaper, but if you have needs commodity systems don't meet, you pay the premium for proprietary systems that do.

jeffbee · 2025-05-18T20:18:36 1747599516

You didn't need imaginary 64-bit PCs because a rack full of smaller 64-bit SPARC systems would have been much cheaper than a single E10k. Something that large in a single system was only necessary for people with irreducible memory requirements, ie not delivering mail.

kstrauser · 2025-05-18T17:56:19 1747590979

First, yes, everything you said is true. And especially when you’re supporting an older application designed around such SPOFs, you need those to be bulletproof. That’s completely reasonable. That said, a fair chunk of my work since the 90s has been in building systems that try to avoid SPOFs in the first place. Can we use sharded databases such that upgrading one doesn’t take the others down? Shared-nothing backend servers? M-to-N meshes so we’re not shoving everything through a single load balancer or switch? Redundant data centers? The list goes on.

I don’t think that approach is inherently better than what you described. Each has its own tradeoffs and there’s a time and place for both of them. I absolutely did see a lot of Big Iron companies marketing their giant boxes as the “real, proven” alternative to a small cluster of LAMP servers, though. I don’t blame them for wanting to be big players in that market, too, but that wasn’t a good reason to use them (unless you already had their stuff installed and wanted to add a web service next to your existing programs).

I wouldn’t run a bank on an EC2 instance, but neither would I ever buy a mainframe to host Wordpress at any scale.

jandrewrogers · 2025-05-19T02:25:02 1747621502

As a technical nit, the 64-bit AMD Opteron was released in 2003, not late 2000s. It almost immediately took over the low- to mid-range server market and HPC market because nothing could touch its performance and scalability for the price. It was a state-of-the-art design for the time and relatively cheap, same vibes as the Apple M1 release.

People still used the big mainframe-y UNIX servers but their usage shrunk and you could see the writing on the wall. I was already replacing Sparc database servers with Opterons in 2004. The hardware wasn’t as gold-plated but they were fast and workloads were already outgrowing the biggest mainframe-y servers.

TBH, a lot of the gold-plated “enterprise” hardware failed far more often in practice than you would expect, including unrecoverable hard failures. That was a common enough experience that it probably detracted from the sales pitch for that extremely expensive hardware.

emmelaich · 2025-05-19T02:59:54 1747623594

True but even then the actual redundancy fell short of that advertised.

trollied · 2025-05-18T16:01:54 1747584114

I don’t disagree. But most also don’t give a shit and then scale horizontally endlessly, and spend too much money, to deal with their crappy code.

As a dev it isn’t your problem if the company you work for just happily provisions and sucks it up.

kstrauser · 2025-05-18T16:27:08 1747585628

That’s a thing, to be sure. The calculus gets a little complicated when that developer’s pay is far more than the EC2 bill. There’s a spectrum with a small shop wasting $1000 a year hosting inefficient code, and Google-scale where SRE teams would love to put “saved .3% on our cloud bill!” on their annual review.

mshook · 2025-05-18T22:40:34 1747608034

That's my experience as well in 2 different companies where we went from 2 E15K to 2 E25K because it was "cheaper" than rewriting who knows how much code, for how long and at what cost

At the other one, we jumped from 2 25k to 2 M9000-64 for the same reasons...

rjsw · 2025-05-18T16:34:00 1747586040

> ... to deal with their crappy code

written in an interpreted language.

znpy · 2025-05-18T17:05:14 1747587914

> Everyone gives a shit about performance at some point, but the answer is horizontal scaling. You can’t vertically scale a single machine to run a FAANG.

You might be surprised about how many companies think they're FAANG (but aren't) though.

kstrauser · 2025-05-18T17:42:25 1747590145

That’s a whole other story, to be sure! “We absolutely must have multi-region simultaneous writes capable of supporting 300,000,000 simultaneous users!” “Your company sells door knobs and has 47 customers. Throw it on PostgreSQL and call it solved.”

znpy · 2025-05-19T12:36:47 1747658207

Yup. And if you reach 94 users, just get a larger machine for PostgreSQL and call it a day, if you really think you need it.

lokar · 2025-05-18T15:42:03 1747582923

In the end that approach to very high scale and reliability was a dead end. It’s much better and cheaper to solve these problems in software using cheap computers and fast networks.

chasil · 2025-05-18T17:38:51 1747589931

If you have applications that run (and rely) on z/OS, this kind a machine makes sense.

The e10k didn't have applications like that. Just about everything you could do on it could be made to work on commodity x86 with Linux (after some years, for 64-bit).

linksnapzz · 2025-05-19T00:55:14 1747616114

Yes, it absolutely did. Tuxedo for transaction processing, for one.

chasil · 2025-05-20T02:56:21 1747709781

Actually, Oracle owns Tuxedo now.

https://www.oracle.com/middleware/technologies/tuxedo.html

It came from AT&T, via BEA.

https://en.m.wikipedia.org/wiki/Tuxedo_(software)

chasil · 2025-05-19T21:50:48 1747691448

Tuxedo ran on many other vendors operating systems and architectures. It was portable, developed by AT&T.

How many other platforms will run IMS?

trollied · 2025-05-18T15:57:29 1747583849

Less cheap computers is still a thing. Entirely missing the point.

lokar · 2025-05-18T16:08:57 1747584537

A lot of the examples here are things like running a large email service. Doing that with this kind of hardware makes no sense.

Henchman21 · 2025-05-18T16:29:24 1747585764

It might make no sense today, but it made loads of sense back then. One cannot apply modern circumstances backwards in time.

hpcjoe · 2025-05-18T16:39:42 1747586382

I recall that while I was at SGI. Many of us within SGI were strongly against the move to sell this off to Sun. We blamed Bo Ewald for the disaster to SGI that this was, the lack of strategic vision on his part. We also blamed the idiots in SGI management for thinking that only MIPS and Irix would be what we would be delivering.

Years later, Ewald and others had a hand in destroying the Beast and Alien CPUs in favor of the good ship Itanic (for reasons).

IMO, Ewald went from company to company, leaving behind a strategic ruin or failure. Cray to SGI to Linux Networx to ...

linksnapzz · 2025-05-19T01:08:46 1747616926

SGI really lost big, having that guy and Beluzzo. Sigh.

nocoiner · 2025-05-18T14:17:34 1747577854

To this day, “Sun E10000 Starfire” is basically synonymous in my head with “top-of-the-line, bad-ass computer system.” What a damn cool name. It made a big impression on an impressionable youth, I guess!

beng-nl · 2025-05-18T15:19:40 1747581580

I agree on all counts, but the installation I had at my job at the time regularly needed repairs..! Hopefully this was an exceptional case, but it gave me the impression of “redundancy added too much complexity to make the whole reliable.”

ETA: particularly because the redundancy was supposed to make it super reliable

somat · 2025-05-18T16:14:57 1747584897

I worry about this sometimes, there is this long tail of "reliability" you can chase, redundant systems, processes, voting, failover, "shoot the other node in the head scripts" etc. But everything adds complexity, now it has more moving parts, more things that can go wrong on weird ways. I wonder if the system would be more reliable if it were a lot simpler and stupid, a single box that can be rebooted if needed.

It reminds me of the lesson of the Apollo computers, The AGC was to more famous computer, probably rightfully so, but there were actually two computers, The other was the LVDC, made by IBM for controlling the Saturn V during launch, now it was a proper aerospace computer, redundant everything, a can not fail architecture, etc. In contrast the AGC was a toy, However this let the AGC be much faster and smaller, instead of reliability they made it reboot well, and instead of automatic redundancy they just put two of them.

https://en.wikipedia.org/wiki/Launch_Vehicle_Digital_Compute...

There is something to be learned here, I am not exactly sure what is is, worse is better?

linksnapzz · 2025-05-19T01:06:57 1747616817

We got the first E15000 outside of Sun when I was at SDSC; engineers from down the street at Towne Centre Drive came by to set it up...It was running Solaris 8 w/ a very specific kernel patch to make it boot; and the driver for the chassis fan control had never been completed, so they were running at 100% once the system powered on. It was like standing next to a Harrier doing a VTOL takeoff.

Also, when the system disk on the boot drawer failed, I discovered that it wan't a standard Sun FCAL or SCA-80 hdd, but a 68-pin scsi drive mounted to what appeared to be a custom-made drive cage that was unlike anything else we had on the floor. It was a real factory prototype.

jeffbee · 2025-05-18T15:35:03 1747582503

No, I think that was typical. Nostalgia tends to gloss over the reality of how dodgy the old unix systems were. The Sun guy had to show up at my site with system boards for the SPARCcenter pretty regularly.

EvanAnderson · 2025-05-19T08:02:50 1747641770

Somebody gave a talk about the E10K at one of the early DefCon conferences and I was blown away. Having only worked with x86 architecture servers I couldn't believe the kind of "magic" dynamic reconfiguration enabled. I'm sad I never got to work with one.

JSR_FDED · 2025-05-18T15:16:32 1747581392

This was one of the all time biggest strategic mistakes SGI made - for a mere $50 million they enabled their largest competitor to rack up huge wins against them almost overnight. A friend at SUN at the time was telling me how much glee they took in sticking it to SGI with its own machines.

cf100clunk · 2025-05-18T16:54:57 1747587297

> one of the all time biggest strategic mistakes SGI made

SGI in the Ewald years tripped itself up, then in the Rick Belluzzo years made a cavalcade of avoidable mistakes.

JSR_FDED · 2025-05-19T06:42:51 1747636971

Belluzzo then went on to repeat that at Quantum

znpy · 2025-05-18T14:52:29 1747579949

According to https://www.filibeto.org/aduritz/truetrue/e10000/e10000.pdf "Its online storage capacity can exceed 60 Tbytes" ... and it could host 64 cpus and 64GB of memory ... crazy considered it's from 1997 :)

kstrauser · 2025-05-18T15:33:11 1747582391

It was only a couple of years after that when I owned my first computer faster than a Cray X-MP. I love being on the receiving end of Moore’s Law.

ajross · 2025-05-19T00:53:04 1747615984

This was a Swan Song machine. It was instantaneously great but part of a dinosaur architecture with no future. It was released in 1997, just as the modern massively parallel datacenter paradigm was launching. By the time Web 2.0 was firing up on AWS, this kind of thing seemed ridiculous. And the world hasn't looked back, really.

It's sort of a recapitulation of the mid-80's, when the last waves of ECL mainframes (c.f. VAX 9000) launched with jaw dropping performance numbers and price tags, just to be buried beneath the flood of cheap CMOS workstations within the decade.

neilv · 2025-05-18T16:22:23 1747585343

> They were also joined with several engineers in Beaverton, Oregon through these mergers.

They might mean from Floating Point Systems (FPS):

https://en.wikipedia.org/wiki/Cray#Cray_Research_Inc._and_Cr...

> In December 1991, Cray purchased some of the assets of Floating Point Systems, another minisuper vendor that had moved into the file server market with its SPARC-based Model 500 line.[15] These symmetric multiprocessing machines scaled up to 64 processors and ran a modified version of the Solaris operating system from Sun Microsystems. Cray set up Cray Research Superservers, Inc. (later the Cray Business Systems Division) to sell this system as the Cray S-MP, later replacing it with the Cray CS6400. In spite of these machines being some of the most powerful available when applied to appropriate workloads, Cray was never very successful in this market, possibly due to it being so foreign to its existing market niche.

Some other candidates for server and HPC expertise there (just outside of Portland proper):

https://en.wikipedia.org/wiki/Sequent_Computer_Systems

https://en.wikipedia.org/wiki/Intel#Supercomputers

(I was very lucky to have mentors and teachers from those places and others in the Silicon Forest, and also got to use the S-MP.)

linksnapzz · 2025-05-19T01:01:12 1747616472

FPS, which had purchased the assets of Celerity Computing in San Diego (well, La Jolla on Towne Centre Drive) , which was where much of the Sun large-systems development would occur.

Celerity had built RISC superminis out of NCR32K chips, running BSD 4.2, then got bought by FPS, then by Cray, then Sun. The Towne Centre Drive property is now Takeda Pharmaceuticals, IIRC.

HillRat · 2025-05-19T21:49:23 1747691363

Still to this day the largest single line item I’ve ever signed off on; I have an old varsity jacket in the back of my closet that was their sales swag for the E10K. Still not convinced that it was that much more cost-effective than a bunch of E6500s for our (embarrassingly parallel) workload, but it was an impressive bit of kit!

jasongill · 2025-05-18T14:19:38 1747577978

This is one of my dream machines to own. The Sun E10k was like the Gibson, it was so mythically powerful. It was a Cray inside of your own server closet, and being able to be the admin of an E10k and have root on a machine with so much power was a real status symbol at the time.

jakupovic · 2025-05-19T11:03:27 1747652607

Sun had the prettiest and fastest machines back in my young days. Their keyboards also were a work of art. I still remember the feel of their keyboards and they were bigger than what we have now. Silky smooth.

bobmcnamara · 2025-05-18T14:31:10 1747578670

Cray-cyber.org used to have free shell accounts on one in Germany.

tverbeure · 2025-05-18T15:35:14 1747582514

I worked for a company that bought one of these. It was delivered, lifted through the window of the server room with a crane and worked fine.

A few days later, our admin noticed over the weekend that he couldn’t remote log in. He checked it out and… the machine was gone. Stolen.

Somebody within Sun must have tipped off where these things were delivered and rented a crane to undeliver them.

pavlov · 2025-05-18T15:40:54 1747582854

Isn’t it more likely it was someone within the company you worked for?

They would have access to site-specific info like how easy it is to get access to that server room to open the windows.

The old saying is “opportunity makes the thief.” Somebody at Sun has much less visibility into the opportunity.

tverbeure · 2025-05-18T18:08:11 1747591691

I was told that it had happened before with a delivery at a different company.

AStonesThrow · 2025-05-18T18:13:29 1747592009

Where do you fence such a thing? That is more than stealing a car. Do you take it to a SPARC Chop Shop and strip it for parts to sell on eBay?

Did they recover this monstrous thing or have any witnesses/leads on who just rocked up with an unauthorized crane to your machine room?

That is sort of a crown-jewels level heist. They pulled it off more than once??

tverbeure · 2025-05-19T03:37:23 1747625843

The RAM alone cost a fortune. But I agree, it would make more sense to break into the machine and just extract the RAMs.

cf100clunk · 2025-05-18T17:08:59 1747588139

Hmmm... I wondered why the official E10K demo machine in the lobby of Sun's HQ back then had been enclosed in glass. It also might very well have just been a mockup, I suppose.