Not to be dismissive, but can't anyone "build" the biggest supercomputer by rese...

floren · on June 22, 2020

The ranking is calculated based on the Linpack benchmark. Being a parallel application, performance is not simply scaled to number of processors; the network interconnect is hugely important.

Now, although Linpack is a better evaluation metric for a supercomputer than simply totaling up # of processors and RAM size, it's still a very specific benchmark of questionable real-world utility; people like it because it gives you a score, and that score lets you measure dick-size, err, computing power. It also, if you're feeling unscrupulous, lets you build a big worthless Linpack-solving machine which generates a good score but isn't as good for real use (an uncharitable person might put Roadrunner https://en.wikipedia.org/wiki/Roadrunner_(supercomputer) in this category)

wnissen · on June 22, 2020

Linpack is pretty lightweight as far as benchmarks go. You need some memory bandwidth but not much network at all, just reduces which are pretty efficient. It's not a good proxy for the most challenging applications, but lives on because no one has a better alternative. Basically, these are the classes of problems: 1. compute bound, trivially parallel. These are like breaking RSA encryption, stuff that was done over the Internet 20 years ago even when links were much slower. Basically doesn't even need the proverbial Beowulf cluster. Linpack is basically in this category, so you could, with care, make a cloud machine to do it. 2. Memory-bandwidth bound, trivially parallel. Stuff like search engine index building, Still not hard to do over distributed networks, or, yes, commercial Ethernet in a Beowulf Cluster. 3. Network bound, coupled parallel. The most challenging category, can only be done with a single-site computer on a fast interconnect. And, as noted, "fast" here has a totally different meaning compared to commercial networking latencies, especially. Depending on the type of network, you can have a significant fraction of the total transistors in the machine in the interconnect. These networks are heavily optimized for specific MPI operations, such as All-to-All, where you might have 1 million cores. The reason is that the whole calculation, being coupled, moves as quickly as the slowest task on the slowest node. You see weird stuff like reserving an entire core just for communicating with the I/O system and handling OS interrupts, because otherwise the "jitter" of nodes randomly doing stuff slows down the entire machine.

sushshshsh · on June 22, 2020

I am curious to learn a bit more about how supercomputer scores measure proportionally to "real world performance", which is a hard thing to quantify since there are probably hundreds of different application "types" in the "real world".

Combine this with the fact that many applications are limited by network throughput rather than by CPU/SSD/RAM/PCIE, and performance becomes a hard thing to quantify even in terms of "how many ARM cores do i need to buy to make my CPU not be the bottleneck"

There are benchmarks for ARM linux compilation and ARM openjdk performance benchmarks which are a good start, but I don't know how to compare SKUs between those ARMs and the ones found in top500 supercomputers.

mxcrossb · on June 22, 2020

HPCG is another benchmark on the Top500 site, and it’s more of a real world benchmark. It’s of course not perfect, but maybe that’s more what you’re looking for.

mxcrossb · on June 22, 2020

AWS has made it into the top500 a few times in fact, though not that high on the list. I think the main issue would be reserving enough machines that have a high performance network between them, which is not a typical cloud need.

But the more interesting question for me is: on an embarrassingly parallel workload, how does Amazon’s full infrastructure compare to these top machines? I’d assume that Amazon keeps that a secret.

cbkeller · on June 23, 2020

Looking into Amazon's power bill might be a useful start: Fugaku is listed as drawing 28 MW in OP. It's more power efficient than most, but to an order of magnitude that's a number we can work with. Amazon's power usage for US-East was estimated at 1.06 GW in 2017 [1] (at which time they also apparently owned about a gigawatt of renewable generating capacity [2], now closer to 2 GW [3]).

Either way you slice it, Amazon likely owns at least an order of magnitude more FLOPS than any single system on the top500. What they presumably don't have is the low latency interconnects, etc., needed for traditional supercomputing.

[1] https://datacenterfrontier.com/amazon-approaches-1-gigawatt-...

[2] https://www.eenews.net/stories/1060048034

[3] https://sustainability.aboutamazon.com/sustainable-operation...

fhqghds · on June 22, 2020

yyyeah... no.

A major part of what makes these machines special is their interconnect. Fujitsu is running a 6D torus interconnect with latencies well in the sub-usecond range. The special sauce is ability of cores to interact with each other with extreme bandwidth at extremely low latencies.

sushshshsh · on June 22, 2020

Thank you for this helpful info. For comparison's sake, say that you wanted to make babby's first super computer in your house with 2 laptops. That is to say, each laptop is a single core x86 system with its own motherboard and ram and ssd, and they are connected to each other in some way (ethernet? usb?)

What software would one use to distribute some workload between these two nodes, what would the latency and bandwidth be bottlenecked by (the network connection?) and what other key statistics would be important in measuring exactly how this cheap $400 (used) set up compares to price/watt/flop performance for top 500 computers?

dekhn · on June 22, 2020

You could use MPI and OpenMP. I got my start building an 10megabit ethernet cluster of 6 machines for $15K (this would have been back in ~2000). It only scaled about 4X using 6 machines, but that was still good enough/cheap enough to replace a bunch of more expensive SGIs, Suns, and HP machines.

Where the bottleneck is depends entirely on many details of the computations you wanted to run. IN many cases, you can get trivial embarassing parallelism if you can break your problem into N parts and there doesn't need to be any real communication between the processors running the distinct parts. In that case, memory bandwidth and clock rate are the bottleneck. but if you're running something like ML training with tight coupling, then the throughput and latency of the network can definitely be a problem.

floren · on June 22, 2020

The thing to keep in mind about supercomputers is that they are designed for particular applications. Nuclear weapons simulation, biological analysis (can we run simulations and get a vaccine?), cryptanalysis. These applications are usually written in MPI, which is what coordinates communication between nodes.

If you want to play with it at home, connect those laptops to an ethernet network and install MPI on them both--you should be able to find tutorials with a little web searching. Then you could probably run Linpack if you felt like it, but if you wanted to learn a little more about how HPC applications actually work, you could write your own MPI application. I wrote an MPI raytracer in college; it's a relatively quick project and, again, you can probably find a tutorial for it online.

Edit: Your cluster is going to suck terribly in comparison to "real" supercomputers, but scientists frequently do build their own small-scale clusters for application development. The actual big machines like Sequoia are all batch-processing and must be scheduled in advance, so it's a lot easier (and cheaper, supercomputer time costs money) to test your application locally in real-time.

gnufx · on June 22, 2020

Summit and Sierra, for instance, actually run a fair range of applications fast, though Sierra is probably targetted mainly at weapons simulation-type tasks. A typical HPC system for research, e.g. university or regional, has to be pretty general purpose.

timthorn · on June 22, 2020

Step by step instructions for building an Arm-based MPI cluster using Raspberry Pis: https://epcced.github.io/wee_archlet/

noir_lord · on June 22, 2020

If you want to get experience of working with higher node counts without breaking the bank, people do case kits for raspberri pies so you can build your own cluster.

For actual computing a modern higher end processor/server will murder it but its' closer to the real world of clusters than anything (so much so that there is a company that does 100+ pi node clusters for super computing labs to test on, you can't obviously run scientific workloads but it's cheaper than using the real machine as well).

https://www.zdnet.com/article/raspberry-pi-supercomputer-los...

gnufx · on June 22, 2020

If you want to understand distributed-memory parallel performance you're probably better off with a simulator, like SimGrid. I don't know what bog-standard hardware you'd need to get a typical correct balance between floating point performance, memory, filesystem i/o, and general interconnect performance otherwise. No toy system is going to teach you about running a real HPC system either -- you really don't want the fastest system if it's going to fall over every few hours or basically fall apart after a year.

corford · on June 22, 2020

For software, I know https://en.wikipedia.org/wiki/HTCondor is used quite frequently in academia for distributed workloads.

01acheru · on June 22, 2020

As blopeur said in another reply you need to feed data to the supercomputer, and as parallelizable as your algorithm may be some data might need to be shared between nodes at some point, just to name a couple of examples.

If you connect a lot of cloud instances to act as a giant distribute computing cluster they’ll receive/share/return data via network interfaces or yet worse the internet, this is really really slower than what super computers do.

For many applications that solution would be more efficient than a supercomputer, but for applications that need a supercomputer it would be inefficient. It just depends on what you need to do, but in any case it would be a computing cluster not a supercomputer.

(my two cents, I’m not into that field)

dekhn · on June 22, 2020

From https://blogs.microsoft.com/ai/openai-azure-supercomputer/:

"""The supercomputer developed for OpenAI is a single system with more than 285,000 CPU cores, 10,000 GPUs and 400 gigabits per second of network connectivity for each GPU server. Compared with other machines listed on the TOP500 supercomputers in the world, it ranks in the top five, Microsoft says."""

Each of the big three cloud providers could, if they chose to, build a #1 Top500 computer using what they have available (that includes the CPUs, GPUs, and interconnect). That said, it's unclear why they would: the profit would be lower than if they sold the equivalent hardware, without the low-latency interconnect, to "regular" customers. The supercomputer business isn't obscenely profitable.

gnufx · on June 22, 2020

I haven't checked the current list, but previous ones have been roughly half "cloud provider" systems with essentially useless interconnects for real HPC work. NCSA refused to indulge in the game with Blue Waters, notably.