Hacker News new | past | comments | ask | show | jobs | submit login
Nvidia launches colossal HGX-2 cloud server to power HPC and AI (techcrunch.com)
60 points by twtw on May 31, 2018 | hide | past | favorite | 21 comments



We built our own 160 tflop (multi machine) GPU setup for ~20k€. Granted, it's not the same quality level as the hgx2, but that gives a reference point for computing power costs nowadays.


With HGX you can have all chips as one giant 'card' with unified memory. That's something you can't really DIY. Even with Quadro you can hook up two at a time via NVLink and that's about it.


Yep. This is a huge distinction for developers. Hodgepodge ("NUMA") systems can easily have the problem of "GPU 1 can quickly talk to GPU 2 but slowly to GPU 3", which is a productivity killer on code that is already > 10X harder to write to beginwith. GPU teams like ours (Graphistry) make simplifying assumptions on the hardware based on where things are going... and uniform memory access (within a cache hierarchy level) is one of them. In this case, we assumed saturating a single GPU is enough, and write kernels such that as single-node multi-GPU boxes become mainstream (like this), that it's in reach to use them. For initiatives like the GPU dataframe devs in the GOAI project, same thing. Today's office conversation: What will this look like for cloud multinode (e.g., rack)? :)

AND... if you like frontend js or fullstack js, and think this stuff is cool, we're actively hiring :) See https://www.graphistry.com/blog/js-gpus-ml-arrow-goai for our thoughts here.


Not just that, but they showed at GTC that the access time to memory on a local card is very close to remote cards over nvlink for small transfer sizes, which is great.


nVLink is a significant advantage.


Impressive. I've lately been wondering though, how does one load data fast enough into such rigs to come close to saturating this kind of compute power? I felt this was an issue even with TPUs and V100s for any application that isn't thoroughly bottlenecked by compute (ex:- one could spend hours loading up a petabyte, and process it all in tens of minutes)


It is challenging. Most likely your data at rest are compressed. Go with the highest-end SSD money can buy, RAID0 them, and layout your compressed data as sequentially as possible. You still need to trip to the main memory and decode though, and then the bandwidth between device / main memory becomes bottleneck. Alternatively you can write a decompressor to do that on device memory directly, haven't heard anyone doing that yet.


I don't think it's supposed to be saturated by a single user. Looks like it's meant for cloud providers to farm out to their clients:

> Nvidia is distributing them strictly to resellers, who will likely package these babies up and sell them to hyperscale data centers and cloud providers.


Considering that it’s main advantage is 16 GPUs which are addressable as a single GPU through NVLINK which is a memory coherent fabric while cloud providers are buying it because most users can’t afford it it’s not designed to be a multi user system but quite the contrary it’s designed to run a single application very very quickly.


It depends on the task. Some tasks require a bunch of GPU processing where it churns for a while before getting new data, while others process data streaming in. The DGX-2 has 8 100Gbps NICs, so you can load data very fast.


Layman question: I know that this is not the use case for these and would be way overkill, but what would happen if you tried to run a game on one of these things? Do they still have the ability to render things, or are they just for cuda core/tensor core computing?


They are not graphics card - no video outputs (HDMI,DisplayPort) and they have no SLI support. So for gaming the best you can get is Titan V


If I understand correctly, games have to specifically opt-in to SLI, the ability to use two consumer graphics cards in parallel/concert, so I would imagine that it's a similar situation here, the game would by default render on just one of the 16 GPUs.


A bit off topic. Lately there have been a few cryptocurrency attacks, where the attacker has taken control of more than 50% of the networks computing power. Many cryptocurrencies are only mineable with GPU. Could these servers be rented to launch this kind of attack?


This server is not the best way to get maximum FLOPS. It provides the best available GPU to GPU transfer bandwidth and latency. Mining doesn't need that, so you could get much better performance for much cheaper by renting or buying individual, disconnected GPUs.


Yes, but at a huge premium to the hash power one could generate with a DIY consumer GPU rig. Crypto mining with these would be like shopping at Home Depot on a Ferrari.


Indeed. Miners don't care about interconnect bandwidth or peering between cards. One popular rig I've seen is an 8x PCIe ATX motherboard, with eight GPUs, but only a Celeron for the CPU. Each of the PCIe slots is 16x form factor, but only 1x throughput.


Still weird to me that they continue to give each GPU both low-precision and high-precision silicon. Why not just make two parts, one high-precision and one low precision?


Most ALUs can re-use the components to be relatively easily reconfigured between low and high precision, the overhead cost tends to be single digit percent, which is generally considered worth it for the added flexibility.


I remember hearing that V100 FP32 cores could be used as pairs of FP16 cores for specific use cases, however I don't think the FP64 cores are decomposable in the same way.


price?




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: