Anyone have an idea about how this would compare with current GPU performance ? ...

scott_s · on April 19, 2012

In my experience, it's not bandwidth that is the limiting factor, but latency. You'll hit the same problem with FPGAs if you're using it as a co-processor, as they are typically connected to the motherboard over PCI Express. If the vectors you're using are small (where "small" means small enough to easily fit into an L1 cache on a processor), then you probably won't see any performance improvement by offloading the computation to an accelerator.

I say this because in a matrix-vector multiplication, only the vector has data-reuse. You do a single pass over the matrix. I wrote a paper where latency killed any performance benefit from using a GPU, because the computation we performed did only a single pass over the data: http://people.cs.vt.edu/~scschnei/papers/debs2010.pdf If you're doing a matrix-matrix multiplication, then that's a different story because each element in each matrix will be reused.

Retric · on April 19, 2012

GPU's have multi gigabit bandwidth and 1.5-6+ GB of on-board ram. The GFLOP performance vary's significantly though.

The Radeon 7970 has 947 GFLOPS Double Precision, but the nvidia cripples it's geforce series to 100GFLOPS to force people to pay for a Quadro 600 that has 515.2GFLOPS Double Precision. Though, if it's a large project paying for some Quadro's are probably worth the cost's for better software support and more RAM IMO.

The problem with FPGA's is they cost about as much but take a lot more effort to anywhere close to those performance numbers. However, they are great if you have some vary odd specific needs and plan on moving to custom chips in the future. AKA, you want to build a custom video encoder and plan on mass producing your own chips, so you already need to develop at really low levels.