Summit Supercomputer Up and Running, Claims First Exascale Application

fermienrico · on June 12, 2018

This is a poorly written article.

Here is a better source: https://www.top500.org/news/summit-up-and-running-at-oak-rid...

Interesting bit about nVidia Tesla V100 GPUs:

Assuming all those nodes are fully equipped, the GPUs alone will provide 215 peak petaflops at double precision. Also, since each V100 also delivers 125 teraflops of mixed precision, Tensor Core operations, the system’s peak rating for deep learning performance is something on the order of 3.3 exaflops.

Those exaflops are not just theoretical either. According to ORNL director Thomas Zacharia, even before the machine was fully built, researchers had run a comparative genomics code at 1.88 exaflops using the Tensor Core capability of the GPUs. The application was rummaging through genomes looking for patterns indicative of certain conditions. “This is the first time anyone has broken the exascale barrier,” noted Zacharia.

dang · on June 12, 2018

Ok, we've changed the URL to that from mhttps://www.eetimes.com/author.asp?section_id=36&doc_id=1333...

distortedlojik · on June 12, 2018

Agreed, check out the top500 article or the direct ORNL System Overview. https://www.olcf.ornl.gov/for-users/system-user-guides/summi...

yread · on June 13, 2018

I wonder what the application was. There is not that much deep learning in genomics.

alfalfasprout · on June 13, 2018

The tensor cores are simply matrix multiplication + accumulation. They don't need to be used purely for deep learning.

dekhn · on June 13, 2018

The code is Comet https://www.sanger.ac.uk/science/tools/comet and based on the PR I think they ran it in embarassingly (or lightly) parallel mode (IE, very little if any communication between nodes except for periodic reduces.

tempdeadbeef · on June 12, 2018

What’s not inside the Summit Supercomputer speaks volumes: Intel.

Knights Landing/Hill/Mill is simply not compelling; Omni-Path was created as an infiniband knockoff that doesn’t beat Mellanox. The Cray Gemini/Aries interconnects can be found all over the top of the list (and the Intel acquisition of those interconnects happened in 2012), but you don’t see Omni-Path replacing anything.

Meanwhile, Nvidia comes out with NVLink and begins to build small clusters of GPUs connected by larger networks containing IBM and Mellanox. A vacuum was created, and IBM and Mellanox moved (back) in.

davidmr · on June 13, 2018

The DOE (and DOD, but with whom I’m less familiar) tends to spread out these purchases over multiple vendors to keep multiple US-based providers able to build and support these machines (and I imagine to keep costs competitive).

The last few acquisitions by ORNL and LANL have been Crays while ANL and LLNL were buying IBM Blue Genes. With this generation, it looks like things have switched. As another poster mentioned, it certainly seems like ANL’s next one will be Cray/Intel. It was going to be based on Knight’s Hill, but Intel cancelling that sort of put the architecture up for grabs.

rbanffy · on June 13, 2018

I would love to see Intel tweaking the Phi line with asymmetric cores like some ARMs do. Having a couple brawny proper Xeon cores and a bunch of smaller 4-thread cores, all coupled with local HBM (and maybe some dedicated HBM for each core) would make it a very versatile part that, with some tuning in number of cores, cache sizes, HBM size, etc, could cover from low-end server all the way to supercomputing.

I don't think there is much doubt core count will increase on all segments and that asymmetric core tech that's currently used in ARM is pretty cool.

gnufx · on June 13, 2018

Do HPC-oriented ARMs do that?

I don't see the advantage of mixing Phi and SKX cores. Just use an appropriate balance of different nodes (maybe not all Intel).

rbanffy · on June 14, 2018

It makes sense for multi-node machines, much like we do some tasks mostly on CPUs and others on GPUs within a single node. A processor like this makes much more sense on desktops and general-purpose servers, as most of the time my Xeon cores are doing things an Atom would be perfectly capable of doing at a fraction of the power consumed. This translates into more heat and more cooling. If you consider a Xeon Phi uses 300 Watts for 256 threads, this translates roughly to 1.2 W per thread, which is well within what I would expect from a very puny Atom core. Being able to power down most of my computer while, say, I write this, would be a very nice feature.

tempdeadbeef · on June 13, 2018

Intel knows better than anyone that they need to sell a roadmap, not a chip. Who’s going to put in a large advance order on a bunch of future silicon that may get cancelled as well?

wumpus · on June 13, 2018

Knockoff? Infinipath used Infiniband at first because we could only build one chip, not a host adapter and a switch. Turns out that Infiniband is flexible enough that we kept on doing that for several generations.

tempdeadbeef · on June 13, 2018

Omni-Path Gen1 has a Wolf River NIC and a Prairie River Switch.

wumpus · on June 13, 2018

What do the code names have to do with Infinipath being a knockoff?

sliken · on June 13, 2018

Omni-path isn't support on AMD chips, let alone Power 9. I don't agree with the decisions, but there you go. We switched from 3 generations of Pathscale IB adapters to Mellanox because of it.

jabl · on June 13, 2018

Sometimes I wonder about the longer-term future of infiniband. Originally it was a multi vendor standard, but nowadays it's a one-shop show. Squeezed from above by high end networks supporting things like adaptive non-minimal routing, and from below by ethernet (roce etc.) And squeezed from the side by Intel with their deep pockets.

And apparently mlnx is pressured by some activist investors to reduce R&D expenses and pay more dividends instead.

quadruplebond · on June 13, 2018

Yes, but rumor has it that Intel will still be the supplier for Argonne’s next computer. Which if everything goes to plan will be the first exascale machine.

rbanffy · on June 13, 2018

Even the previous generation Phi's pack about 3.x peak GFLOPS per socket. Tuning HBM size and AVX512 pipelines can certainly help increase that, even if their 10nm is proving harder than expected. It's a matter of time before they can go full 10nm.

gnufx · on June 13, 2018

You certainly see OPA replacing IB -- I personally know several examples, though some HPC vendors now seem to be making it difficult to buy. It has a fairly healthy showing at top500, considering it hasn't been widely available for long. I know the Mellanox propaganda, and have some experience of it, along with promises of help to make it work that evaporated.

I can't remember where to look for OPA's features to help MPI implementation, but someone else might be able to comment.

yaschobob · on June 13, 2018

This is very short sighted. Just wait a month or so and this machine will not look like anything special. ORNL typically doesn't use Intel.

taliesinb · on June 12, 2018

> 27,648 Nvidia Tesla V100 GPU modules

holy shit

dogma1138 · on June 12, 2018

I always wondered what’s the bulk discount on these since it’s north of $200,000,000 for GPUs alone.

gh02t · on June 13, 2018

This article suggests that Summit cost somewhere in the low $100 million range and last I heard from DoE that was accurate, so they aren't exactly paying retail.

https://www.nextplatform.com/2016/11/20/details-emerge-summi...

Price isn't really a concern with these computers, though, because the sort of experimental work they are intended to "replace" (using that loosely, since a lot of the things they simulate are impossible to actually do) is far more expensive. Leadership-grade computing is all about enabling new classes of problems to be solved.

civilitty · on June 13, 2018

Regardless of the leadership position, most of the time a bunch of money can be allocated from nuke maintenance budgets. Maintaining a massive nuclear arsenal without testing requires a ton of computing power and the cost of upgrading supercomputers a few times a decade is a rounding error.

gh02t · on June 13, 2018

Summit is an open science machine (meaning any researcher can apply for time, but it's very competitive so getting an allocation is difficult) and is not budgeted to NNSA for stockpile maintenance as far as I know. It should be running mostly non-classified scientific research, though there is a provision for running sensitive research like stuff related to operating nuclear plants. Sierra at LLNL is the one that is primarily targeted towards classified weapons-related research and NNSA operates independently within DoE.

civilitty · on June 13, 2018

It's not budgeted for it but I heard that some of the FastForward 2 program money ended up going to ORNL because NNSA needed to do some early stage verification that was prohibitively expensive for them to do on their own but trivial for Summit (since they were already getting the hardware).

AFAIK there isn't a single publicly owned supercomputer in the US that wasn't funded in small part by stockpile maintenance budgets, even if they were never used for that purpose.

gh02t · on June 14, 2018

Very likely true. I know that stockpile maintenance is at least a tangential concern for all of them, even if it's just to have something available in reserve. Budget allocations are complex beasties, especially in DoE which pursues a variety of missions. The vast majority of work on the ORNL computers is not weapons-related, however.

IncRnd · on June 13, 2018

I'm not sure you can get a discount for any purchase, since 100% will be sold anyway.

dboreham · on June 13, 2018

In high volume you cut out various middlemen so the price can be lower while the per-die revenue remains the same.

dogma1138 · on June 13, 2018

There isn’t a middleman for this configuration you can only get the NVLINK ones from NVIDIA.

dboreham · on June 14, 2018

I don't know for sure obviously, but based on my time in the semiconductor industry they will force lower volume sales through distribution (even if the part is not generally available). This is for financial risk and supply chain reasons (e.g. RMAs, shipping/customs/tax/duty).

ianai · on June 13, 2018

There’s still something to be said for inherent price insensitivity for the national labs.

tempdeadbeef · on June 12, 2018

Bulk discount when currency miners will pay full price?

quadruplebond · on June 13, 2018

Nvidia was probably under contract for this before mining blew up to where it is today. Provisioning for these machines starts 5-10 years before they ever come online.

dogma1138 · on June 13, 2018

This is a product that is only available from NVIDIA directly there is zero mining pressure on this specific product and channel.

kristianp · on June 13, 2018

Why do they need expensive Power 9s when the bulk of the processing is done by GPUs?

quadruplebond · on June 13, 2018

Partly because we don’t want to be locked into using Intel for both of DOE’s open science machines. In the lead up to the exascale or near exascale machines (2012-now) I don’t remember hearing much about AMD, so presumably Intel and IBM were the only vendors supplying general purpose CPUs.

bryanlarsen · on June 13, 2018

Because Power 9's have on-chip NVLink ports.

JohannFlobuster · on June 13, 2018

This. CPUs are hungry.

crb002 · on June 13, 2018

Because all the Blue Gene/L/P/Q code is POWER.

quadruplebond · on June 13, 2018

Sort of, but Blue Gene tech was always an Argonne thing, Oak Ridge has been the one doing GPUs for awhile now. Unless something weird happens Argonne and Oak Ridge will always be running different types of computer since we don’t want to be locked into one specific vendor or hardware configuration.

eslaught · on June 13, 2018

Aren't all the new DOE machines GPU at this point? I understand A21 and NERSC9 both are, though the vendor isn't announced yet as far as I'm aware.

code_duck · on June 13, 2018

I drive by a site with large signs declaring it to be an Exascale computing construction project at Los Alamos, but I'm not sure what they're building. I'm sure they've announced it somewhere.

agumonkey · on June 12, 2018

I think I still have my nv3 graphics card. Fun to think of how Nvidia grew.

marc1984 · on June 12, 2018

[flagged]

code_duck · on June 13, 2018

I'd go for 'imagine a Beowulf cluster of these'

newnewpdro · on June 12, 2018

"Competition remains stuff."

newnewpdro · on June 12, 2018

Is there anything preventing them from mining crypto as a burn-in to help subsidize the costs?

qop · on June 13, 2018

Yeah, common sense.

Also, actual science work that they have to do.

newnewpdro · on June 13, 2018

> Yeah, common sense

Could you explain

> Also, actual science work that they have to do.

Presumably these clusters aren't 100% utilized 100% of the time. They certainly weren't at the national lab I had access to ages ago...

My question was more inquiring if there's any red tape preventing the laboratory from paying the bills with mining, if it could do so profitably. Let's just assume there's idle time where this could hypothetically occur.

I'm not even trying to suggest that they should do that, it's just an interesting relatively new possibility and these things are quite expensive.

mabbo · on June 13, 2018

The cost of the energy to do so would not be worth the bitcoins mined. Maybe some alternative crypto-currencies would break even, but it wouldn't be easy.

Sitting idle, the machine is not using nearly the power draw of when it's running full-tilt.

jakeogh · on June 13, 2018

I'd be surprised if it sits idle at all. Someone correct me, but it's gotta have a job queue, and a priority for each. There are plenty of problems (every single one submitted) where it's energy consumption is irrelevant.

ianai · on June 13, 2018

https://slurm.schedmd.com/

It still depends on scientists to produce code that can run during off hours. (And how well are scientists known to code?)

bob_theslob646 · on June 13, 2018

>The cost of the energy to do so would not be worth the bitcoins mined. Maybe some alternative crypto-currencies would break even, but it wouldn't be easy.

Source?

ianai · on June 13, 2018

No one wants the national labs accepting payments from anyone but the US government. Especially if they’re selling “digital assets.” I’m trying very hard to not litter this reply with swear words. Concerns about national security abound.

vertexFarm · on June 13, 2018

lol yeah, they have actual productive things to work on.