I provisioned, administered, and used so many Sun systems. I loved them so much that I still keep many Sun workstations and servers in my basement in case I need a nostalgia kick. I tried to use Marie Kondo's KonMari method to get rid of them, but it just didn't work... they all SPARC joy.
Not the 10000, but I admin'd a 4500 back in 1999 at Bristol-Myers Squibb at the ripe old age of 21. It was running Sun's mail server, which required constant care and feeding to even remotely reliably serve our 30,000+ users.
One time it just stopped responding, and my boss said "now, pay attention" and body-checked the machine as hard as he could.
It immediately started pinging again, and he refused to say anything else about it.
Amiga had a similar issue. One of the chips (fat Agnes IIRC?) didn't quite fit in the socket correctly, and a common fix was to pull out the drive mechanisms and drop the chassis something like a foot onto a carpeted floor.
Somewhat related, one morning I was in the office early and an accounting person came in and asked me for help, her computer wouldn't turn on and I was the only other one in the office. I went over, poked the power button and nothing happened. This was on a PC clone. She has a picture of her daughter on top of the computer, so I picked it up, gave the computer a good solid whack on the side, sat the picture down and poked the power button and it came to life.
Ah, percussive maintenance! Also good for reseating disks that just don’t quite reliably get enumerated, slam the thing back in. I had to do something similar on a power supply for a V440, thankfully it was a month or so away from retirement, I didn’t feel too bad giving it some encouragement like that. Great machines.
Throughout the late 90s, “Mail.com” provided white-label SMTP services for a lot of businesses, and was one of the early major “free email” providers. Each Free user had a storage limit of something like 10MB, which is plenty in an era before HTML email and attachments were commonplace. There were racks upon racks of SCSI disks from various vendors for the backend - but the front end was all standard Sendmail, running on Solaris servers.
I worked at a competing white-label email provider in the 90s and even then it seemed obvious that running SMTP on a Sun Enterprise was a mistake. You're not gaining anything from its multiuser single-system scalability. I guess it stands as an early example of pets/cattle debate. My company was firmly on the cattle side.
I was just the Teenage intern responsible for doing the PDU Cabling every time a new rack was added, since nobody on the Network or Software Engineering teams could fit into the crawl spaces without disassembling the entire raised-floor.
I do know that scale-out and scale-up were used for different parts of the stack. The web services were all handled by standard x86 machines running Linux - and were all netbooted in some early orchestration magic, until the day the netboot server died. I think the rationale for the large Sun systems was the amount of Memory that they could hold - so the user name and spammer databases could be held in-memory on each front end, allowing for a quick ACCEPT or DENY on each incoming message - before saving it out to a mailbox via NFS.
Makes sense, there are a lot of reasons why having some "big iron" might have been practical in that era. x86 was not a full contender for many workloads until amd64, and a lot of the shared-nothing software approaches were not really there until later.
I used to love working with E10k/E15k boxes. I was a performance engineer for a telco software provider, and it was so much fun squeezing every single thing out of the big iron.
It’s a bit sad that nobody gives a shit about performance any more. They just provision more cloud hardware. I saved telcos millions upon millions in my early career. I’d jump straight into it again if a job came up, so much fun.
I used to work for a telco equipment provider around the time everyone was replacing PDH with SONET. Telcos were gagging to buy our stuff, the main reason being basic hardware advances.
Telephone Exchanges / Central Offices have to be in the centre of the lines they serve, meaning some very expensive real estate, and datacenter-level HVAC in the middle of cities is very, very expensive.
They loved nothing more than to replace old 1980s switches with ones that took up a quarter to a tenth of the floorspace, used less than half the electricity, and had fabrics that could switch fibre optics directly.
> It’s a bit sad that nobody gives a shit about performance any more. They just provision more cloud hardware.
It's hard to get as excited about performance when the typical family sedan has >250HP. Or when a Raspberry Pi 5 can outrun a maxxed-E10k on almost everything.
...(yah, less RAM, but you need fewer client connections when you can get rid of them quickly enough).
My experience was a bit different. I first saw a Starfire when we were deploying a bunch of Linux servers in the DC. The Sun machine was brilliant, fast, enormous, and far more expensive per unit of work than these little x86 boxes we were carting in.
The Starfire started at around $800K. Our Linux servers started at around $1K. The Sun box was not 800x faster at anything than a single x86 box.
It was an impressive example of what I considered the wrong road. I think history backs me on this one.
> It’s a bit sad that nobody gives a shit about performance any more.
Everyone gives a shit about performance at some point, but the answer is horizontal scaling. You can’t vertically scale a single machine to run a FAANG. At a certain vertical scale, it starts to look a helluva lot like horizontal scaling (“how many CPUs for this container? How many drives?”), except in a single box with finite and small limits.
> The Sun box was not 800x faster at anything than a single x86 box.
You don't buy enterprise gear because it's economical for bulk number-crunching... You buy enterprise gear when you have a critical SPOF application (typically the database) that has to be super-reliable, or that requires greater resources than you can get in commodity boxes.
RAS is an expensive proposition. Commodity servers often don't have it, or have much less of it than enterprise gear. Proprietary Unix systems offered RAS as a major selling point. IBM mainframes still have a strong market today.
It wasn't until the late 2000's when x86 went to 64-bit, so if your application wanted to gobble more than 2GB/4GB of RAM, you had to go with something proprietary.
It was even more recently that the world collectively put a huge amount of effort in, and figured out how to parallelize a large amount of number-crunching problems that were previously limited to single-threaded.
There have been many situations like these through the history of computing... Going commodity is always cheaper, but if you have needs commodity systems don't meet, you pay the premium for proprietary systems that do.
You didn't need imaginary 64-bit PCs because a rack full of smaller 64-bit SPARC systems would have been much cheaper than a single E10k. Something that large in a single system was only necessary for people with irreducible memory requirements, ie not delivering mail.
First, yes, everything you said is true. And especially when you’re supporting an older application designed around such SPOFs, you need those to be bulletproof. That’s completely reasonable. That said, a fair chunk of my work since the 90s has been in building systems that try to avoid SPOFs in the first place. Can we use sharded databases such that upgrading one doesn’t take the others down? Shared-nothing backend servers? M-to-N meshes so we’re not shoving everything through a single load balancer or switch? Redundant data centers? The list goes on.
I don’t think that approach is inherently better than what you described. Each has its own tradeoffs and there’s a time and place for both of them. I absolutely did see a lot of Big Iron companies marketing their giant boxes as the “real, proven” alternative to a small cluster of LAMP servers, though. I don’t blame them for wanting to be big players in that market, too, but that wasn’t a good reason to use them (unless you already had their stuff installed and wanted to add a web service next to your existing programs).
I wouldn’t run a bank on an EC2 instance, but neither would I ever buy a mainframe to host Wordpress at any scale.
As a technical nit, the 64-bit AMD Opteron was released in 2003, not late 2000s. It almost immediately took over the low- to mid-range server market and HPC market because nothing could touch its performance and scalability for the price. It was a state-of-the-art design for the time and relatively cheap, same vibes as the Apple M1 release.
People still used the big mainframe-y UNIX servers but their usage shrunk and you could see the writing on the wall. I was already replacing Sparc database servers with Opterons in 2004. The hardware wasn’t as gold-plated but they were fast and workloads were already outgrowing the biggest mainframe-y servers.
TBH, a lot of the gold-plated “enterprise” hardware failed far more often in practice than you would expect, including unrecoverable hard failures. That was a common enough experience that it probably detracted from the sales pitch for that extremely expensive hardware.
That’s a thing, to be sure. The calculus gets a little complicated when that developer’s pay is far more than the EC2 bill. There’s a spectrum with a small shop wasting $1000 a year hosting inefficient code, and Google-scale where SRE teams would love to put “saved .3% on our cloud bill!” on their annual review.
That's my experience as well in 2 different companies where we went from 2 E15K to 2 E25K because it was "cheaper" than rewriting who knows how much code, for how long and at what cost
At the other one, we jumped from 2 25k to 2 M9000-64 for the same reasons...
> Everyone gives a shit about performance at some point, but the answer is horizontal scaling. You can’t vertically scale a single machine to run a FAANG.
You might be surprised about how many companies think they're FAANG (but aren't) though.
That’s a whole other story, to be sure! “We absolutely must have multi-region simultaneous writes capable of supporting 300,000,000 simultaneous users!” “Your company sells door knobs and has 47 customers. Throw it on PostgreSQL and call it solved.”
In the end that approach to very high scale and reliability was a dead end. It’s much better and cheaper to solve these problems in software using cheap computers and fast networks.
If you have applications that run (and rely) on z/OS, this kind a machine makes sense.
The e10k didn't have applications like that. Just about everything you could do on it could be made to work on commodity x86 with Linux (after some years, for 64-bit).
I recall that while I was at SGI. Many of us within SGI were strongly against the move to sell this off to Sun. We blamed Bo Ewald for the disaster to SGI that this was, the lack of strategic vision on his part. We also blamed the idiots in SGI management for thinking that only MIPS and Irix would be what we would be delivering.
Years later, Ewald and others had a hand in destroying the Beast and Alien CPUs in favor of the good ship Itanic (for reasons).
IMO, Ewald went from company to company, leaving behind a strategic ruin or failure. Cray to SGI to Linux Networx to ...
To this day, “Sun E10000 Starfire” is basically synonymous in my head with “top-of-the-line, bad-ass computer system.” What a damn cool name. It made a big impression on an impressionable youth, I guess!
I agree on all counts, but the installation I had at my job at the time regularly needed repairs..! Hopefully this was an exceptional case, but it gave me the impression of “redundancy added too much complexity to make the whole reliable.”
ETA: particularly because the redundancy was supposed to make it super reliable
I worry about this sometimes, there is this long tail of "reliability" you can chase, redundant systems, processes, voting, failover, "shoot the other node in the head scripts" etc. But everything adds complexity, now it has more moving parts, more things that can go wrong on weird ways. I wonder if the system would be more reliable if it were a lot simpler and stupid, a single box that can be rebooted if needed.
It reminds me of the lesson of the Apollo computers, The AGC was to more famous computer, probably rightfully so, but there were actually two computers, The other was the LVDC, made by IBM for controlling the Saturn V during launch, now it was a proper aerospace computer, redundant everything, a can not fail architecture, etc. In contrast the AGC was a toy, However this let the AGC be much faster and smaller, instead of reliability they made it reboot well, and instead of automatic redundancy they just put two of them.
We got the first E15000 outside of Sun when I was at SDSC; engineers from down the street at Towne Centre Drive came by to set it up...It was running Solaris 8 w/ a very specific kernel patch to make it boot; and the driver for the chassis fan control had never been completed, so they were running at 100% once the system powered on. It was like standing next to a Harrier doing a VTOL takeoff.
Also, when the system disk on the boot drawer failed, I discovered that it wan't a standard Sun FCAL or SCA-80 hdd, but a 68-pin scsi drive mounted to what appeared to be a custom-made drive cage that was unlike anything else we had on the floor. It was a real factory prototype.
No, I think that was typical. Nostalgia tends to gloss over the reality of how dodgy the old unix systems were. The Sun guy had to show up at my site with system boards for the SPARCcenter pretty regularly.
Somebody gave a talk about the E10K at one of the early DefCon conferences and I was blown away. Having only worked with x86 architecture servers I couldn't believe the kind of "magic" dynamic reconfiguration enabled. I'm sad I never got to work with one.
This was one of the all time biggest strategic mistakes SGI made - for a mere $50 million they enabled their largest competitor to rack up huge wins against them almost overnight. A friend at SUN at the time was telling me how much glee they took in sticking it to SGI with its own machines.
This was a Swan Song machine. It was instantaneously great but part of a dinosaur architecture with no future. It was released in 1997, just as the modern massively parallel datacenter paradigm was launching. By the time Web 2.0 was firing up on AWS, this kind of thing seemed ridiculous. And the world hasn't looked back, really.
It's sort of a recapitulation of the mid-80's, when the last waves of ECL mainframes (c.f. VAX 9000) launched with jaw dropping performance numbers and price tags, just to be buried beneath the flood of cheap CMOS workstations within the decade.
> In December 1991, Cray purchased some of the assets of Floating Point Systems, another minisuper vendor that had moved into the file server market with its SPARC-based Model 500 line.[15] These symmetric multiprocessing machines scaled up to 64 processors and ran a modified version of the Solaris operating system from Sun Microsystems. Cray set up Cray Research Superservers, Inc. (later the Cray Business Systems Division) to sell this system as the Cray S-MP, later replacing it with the Cray CS6400. In spite of these machines being some of the most powerful available when applied to appropriate workloads, Cray was never very successful in this market, possibly due to it being so foreign to its existing market niche.
Some other candidates for server and HPC expertise there (just outside of Portland proper):
FPS, which had purchased the assets of Celerity Computing in San Diego (well, La Jolla on Towne Centre Drive) , which was where much of the Sun large-systems development would occur.
Celerity had built RISC superminis out of NCR32K chips, running BSD 4.2, then got bought by FPS, then by Cray, then Sun. The Towne Centre Drive property is now Takeda Pharmaceuticals, IIRC.
Still to this day the largest single line item I’ve ever signed off on; I have an old varsity jacket in the back of my closet that was their sales swag for the E10K. Still not convinced that it was that much more cost-effective than a bunch of E6500s for our (embarrassingly parallel) workload, but it was an impressive bit of kit!
This is one of my dream machines to own. The Sun E10k was like the Gibson, it was so mythically powerful. It was a Cray inside of your own server closet, and being able to be the admin of an E10k and have root on a machine with so much power was a real status symbol at the time.
Sun had the prettiest and fastest machines back in my young days. Their keyboards also were a work of art. I still remember the feel of their keyboards and they were bigger than what we have now. Silky smooth.
Hmmm... I wondered why the official E10K demo machine in the lobby of Sun's HQ back then had been enclosed in glass. It also might very well have just been a mockup, I suppose.