What's particularly interesting here is that the Fiji card they propose is a very different beast than any of the NVIDIA offerings.
The MI8 card's HBM has a great power and performance advantage (512 GB/s peak bandwidth) even if it's on 28 nm. NVIDIA has nothing that has even remotely comparable bandwidth in this price/perf/TDP regime.
None of the NVIDIA GP10[24] Teslas have GDDR5X -- not to surprising given that it was rushed to marked, riddled with issues, and barely faster than GDDR5. Hence, the P4 has only 192 Gb/s peak BW; while the P40 does have 346 GB/s peak, it is far higher TDP, different form factor and not intended for cramming in into custom servers.
[I don't work in the field, but] To the best of my knowledge inference is often memory bound (AFAIK GEMV-intensive so low flops/byte), so the Fiji card should be pretty good at inference. In such use-cases GP102 can't compete in bandwidth. So the MI8 with 1.5 the Flop rate, 2.5x bandwidth and likely ~2x higher TDP (possibly configurable like the P4) offers an interesting architectural balance which might very well be quite appealing for certain memory-bound use-cases -- unless of course the same cases are also need large memory.
Update: should have looked closer at the benchmarks in the announcement; in particular the MIOpen benchmarks [1] MI8 clearly beating even TitanX-Pascal which has higher BW than the P40 indicates that this card will be pretty good for latency-sensitive inference as long as stuff fits in 4 GB.
> Not sure if AMD is going to go all HBM on all it's high performance GPUs in 2017, or only offer one or two models with it.
It would make perfect sense to have some GDDR5X-based medium-range GPUs. HBM2 will be expensive, too expensive for the top of the medium range (and the same applies for NVIDIA). GDDR5X has plenty of room for improvement over GDDR5 and by next year they should have it figured out better.
From the photos and perf numbers it looks like the linup is RX480, R9 Nano, and whatever the Vega10 gets called, minus some of the connectors and passively cooled.
The PCIE NVIDIA P100 has peak memory bandwidth of 730GB/s, at 250 Watts, which is almost exactly the same bw/watt as the MI8 (well, there are two PCIE P100s, I mean the better one).
Those MIOpen benchmarks are a bit dubious, since MIOpen is AMDs own deep learning framework. It's unlikely that code written by AMD is optimal for the Nvidia hardware.
To be realistic you need to compare AMD hardware running MIOpen to NV hardware running a framework backed by cuDNN.
It's clearly indicated on the slide that those are Deepbench [1] GEMM and GEMM-convolution numbers. Data for M40, TITAN Maxwell/Pascal and Intel KNL is actually provided by Baidu in their Github repo.
The MI8 card's HBM has a great power and performance advantage (512 GB/s peak bandwidth) even if it's on 28 nm. NVIDIA has nothing that has even remotely comparable bandwidth in this price/perf/TDP regime. None of the NVIDIA GP10[24] Teslas have GDDR5X -- not to surprising given that it was rushed to marked, riddled with issues, and barely faster than GDDR5. Hence, the P4 has only 192 Gb/s peak BW; while the P40 does have 346 GB/s peak, it is far higher TDP, different form factor and not intended for cramming in into custom servers.
[I don't work in the field, but] To the best of my knowledge inference is often memory bound (AFAIK GEMV-intensive so low flops/byte), so the Fiji card should be pretty good at inference. In such use-cases GP102 can't compete in bandwidth. So the MI8 with 1.5 the Flop rate, 2.5x bandwidth and likely ~2x higher TDP (possibly configurable like the P4) offers an interesting architectural balance which might very well be quite appealing for certain memory-bound use-cases -- unless of course the same cases are also need large memory.
Update: should have looked closer at the benchmarks in the announcement; in particular the MIOpen benchmarks [1] MI8 clearly beating even TitanX-Pascal which has higher BW than the P40 indicates that this card will be pretty good for latency-sensitive inference as long as stuff fits in 4 GB.
[1] http://images.anandtech.com/doci/10905/AMD%20Radeon%20Instin...