YARN is just a resource manager on top of which Hadoop jobs are run e.g. Hive, P...

dalke · on April 8, 2015

I understand your last paragraph. Looking this time at Hazelcast, what I see is layers of code to understand before being able to do something simple. It really does look like all of the technology you are pointing to is solving a different problem. It's not related to any of the HPC needs I've heard of.

Parts of my simulation are out of phase. I need some gather step to collect the data from individual nodes, when a given timestep is reached, and save the state. A simple solution is to do a barrier every ~30 minutes, send to the master node, and have it save the data.

When I look at Hazelcast I see what looks to be a different sort of clustering - using clusters for redundancy, and not for CPU power. Eg, I see "Hazelcast keeps the backup of each data entry on multiple nodes", and I think "I don't care." If a node goes down, the system goes down, and I restart from a checkpoint. It's much more likely that one of the 512 compute nodes will go down than some database node.

I'll withdraw my original statement that "A map-reduce system like Hadoop" and say simply "a system like Hadoop isn't a good fit for HPC problems".

Here's a lovely essay which agrees with me ;) http://glennklockwood.blogspot.com.au/2014/05/hadoops-uncomf... . It considers the questions:

> Why does Hadoop remain at the fringe of high-performance computing, and what will it take for it to be a serious solution in HPC?

DannoHung · on April 8, 2015

Why does the system go down if a node goes down? Shouldn't the system keep chugging along at a slightly reduced capacity until the node comes back online?

Sorry, I'm not involved in HPC at all. I know a little bit about Hadoop. I'm mostly interested in building online message processing and blended real-time/historical analytics. Our problem domain wouldn't want to lose all capacity if part of the system became unavailable.

dalke · on April 8, 2015

There are several different aspects which make recovery hard. HPC tries to push the edge of what's possible with hardware. It does this by throwing redundancy out the window.

First, the simulation can be set up to match the hardware. One simulation program I used expected that the nodes would be set up in a ring, so that messages between i and (i+1)%N were cheap. It ran on hardware with two network ports, one forwards and one backwards in the ring. In fact, the only way to talk between non-neighbors was to forwards through the neighbors.

If a node goes down, then the ring is broken, and the entire system goes down.

This is very different than a cluster with point-to-point communications, where a router can redirect a message to a backup node should one of the main nodes go down.

The reason for this architecture is that there's a lot of inter-node traffic. When I was working on this topic back in the 1990s, we were network limited until we switched to fiber optic/ATM. When you read about HPC you'll hear a lot about high-speed interconnects, and using DMA-based communication instead of TCP for higher performance. All of this is to reduce network contention.

Suppose there's 1GB/s of network traffic for each node. (High-end clusters use InfiniBand to get this performance.) In order to have a backup handy, all of that data for each node needs to be replicated somewhere. That's more network traffic. Presumable there are many fewer spare nodes than real nodes, since otherwise that's a lot of expensive hardware that's only rarely used. If there are 512 real nodes and 1 backup node, than that backup node has to handle 512GB/second. Of course, the backup node can die, so you really want to have several nodes, each with a huge amount of bandwidth.

Even then, the messages only exchange part of the state data. For example, in a spatial decomposition, each node might handle (say) 1,000 cells of a larger grid. The contents of a cell can interact with each other, and with the contents of its neighbor cells, up to some small radius away. (For simplicity, assume the radius is only one cell away, so there are 26 neighbors for each cell.)

If one node hosts one cell and another node hosts another then at each step they will have to exchange cell contents, in order to compute the interactions. This requires network overhead.

On the other hand, a good spatial decomposition will minimize the amount of network traffic by putting most neighbors on the same machine. After all, memory bandwidth is higher than network, and doesn't have the same contention issues.

But this means that the node has mutating state which isn't easily observed by recording and replaying the network. Instead, the backup node needs to get a complete state update of the entire system.

This is a checkpoint. But notice that I used a spatial decomposition to minimize network usage by not sending all of the data all of the time? I've thrown that out of the window. Now I need to checkpoint all of the time, and have the ability to replay the network requests that the node is involved in, should it go down.

This is complicated, and will likely exceed what the hardware can do, given that it's already using high-end hardware for the normal operations.

DannoHung · on April 8, 2015

Ah, I see. I didn't realize so much of the problem domain was at the network layer.

For our domain, we'll gladly accept the increased network cost and node redundancy for durability because most of work ends up being not involved with other nodes (most of our computations can occur wherever the data is stored and mutations, aside from append, are infrequent).

Thank you for giving me some context.