Field programmable gate array that's 4.2x faster than a 16 core CPU

ramchip · on April 19, 2012

The title doesn't mean much if you don't specify at what. It's only about floating point, and the comparison is with a purely theoretical CPU.

mkl · on April 19, 2012

Also, two of article's four authors work for Xilinx, which makes the FPGA in question.

Does anyone know how much these things cost? A quick google yielded nothing, but I may using the wrong terms.

mseebach · on April 19, 2012

Check farnell.com. They don't seem to stock the Virtex-7 family yet, but the price-range of the Virtex-6 family might provide some guidance (£350-£850).

Edit: This is for the naked chip, not a board. There's quite a discrepancy with int19h's findings, I don't know if that can be all up to the board?

Amadiro · on April 19, 2012

Boards for high-end FPGAs like that are not cheap to make yourself either (needs like 6 layers or more, because they have SO MANY densely packed pins + you need one or two layers to feed its gratuitous hunger for power), so you'll probably end up having to buy some ready-to-use solution that probably quadruples that price...

int19h · on April 19, 2012

I found one board, EK-V7-VC707-CES-G, for USD3.5k

The only time I was ever "in the market" for a board was in 2007, and back then it was a struggle getting parts; this was in the days of FX/LX/SX availability, and I wanted an SX part (with many DSP units) but basically got told no unless I wanted to buy over 50 units. So, had to settle for the bog standard FX. Paid USD5k for it (a premium for Infiniband connectivity).

Someone · on April 19, 2012

I do not think they crippled that theoretical CPU, though:

  "The floating point performance for the reference microprocessor
   is calculated by multiplying the number of floating point
   functions units on each core by the number of cores and by the
   clock frequency.
   ...
   this article series has been using a normalized value of 2.5 GHz
   clock frequency."

Jimmie · on April 19, 2012

"Field programmable gate array that's 4.2x faster than a 16 core CPU", theoretically and only in regards to 64-bit floating point arithmetic.

What's with the link-baity titles lately?

ColinWright · on April 19, 2012

I'm quoting directly from the article:

    Comparing theoretical peaks for 64-bit floating point
    arithmetic, the current generation of Xilinx’s Virtex-7
    FPGAs is about 4.2 times faster than a 16-core microprocessor.

And with regards your question about titles "lately", I'd be interested to see what other submissions I've made that you think are "link-baity".

Thanks.

tubs · on April 19, 2012

I think it was in reference to other titles submitted lately in general, not necessarily by you.

Jimmie · on April 19, 2012

Sorry, I didn't mean that you have submitted link-baity titles lately. I don't think I've read any of your submissions and if I did I wouldn't know it :)

The thing is, you left out some information that transforms the way the title reads to people who haven't read the article. People will think, as I did, "Wow, they've made improvements to FPGAs and got them way faster than CPUs", click through and find out that the performance gains are currently only theoretical, not empirical and also that the 4.2x number is only for a very specific type of problem.

Whether intentional or not, the title implies something greater than the article reports. That's annoying, I like article titles to be informative not inflationary.

javert · on April 19, 2012

I think that title is a fair summary. Any title has to be taken with a grain of salt until one reads the article, anyway.

tgflynn · on April 19, 2012

Anyone have an idea about how this would compare with current GPU performance ? My impression is that GPU's are currently way ahead of CPU's in floating point performance (though maybe not for 64 bit ?).

EDIT: To make this question a bit more specific, say I wanted to develop a really fast neural net implementation, which basically reduces to matrix-vector multiplication and function interpolation. Would I be better off looking to do this with a GPU or an FPGA given the current state of both technologies ?

From what little experience I've had with GPU's I think bandwidth to the device might be a limiting factor but I'm guessing this would affect either type of co-processor.

scott_s · on April 19, 2012

In my experience, it's not bandwidth that is the limiting factor, but latency. You'll hit the same problem with FPGAs if you're using it as a co-processor, as they are typically connected to the motherboard over PCI Express. If the vectors you're using are small (where "small" means small enough to easily fit into an L1 cache on a processor), then you probably won't see any performance improvement by offloading the computation to an accelerator.

I say this because in a matrix-vector multiplication, only the vector has data-reuse. You do a single pass over the matrix. I wrote a paper where latency killed any performance benefit from using a GPU, because the computation we performed did only a single pass over the data: http://people.cs.vt.edu/~scschnei/papers/debs2010.pdf If you're doing a matrix-matrix multiplication, then that's a different story because each element in each matrix will be reused.

Retric · on April 19, 2012

GPU's have multi gigabit bandwidth and 1.5-6+ GB of on-board ram. The GFLOP performance vary's significantly though.

The Radeon 7970 has 947 GFLOPS Double Precision, but the nvidia cripples it's geforce series to 100GFLOPS to force people to pay for a Quadro 600 that has 515.2GFLOPS Double Precision. Though, if it's a large project paying for some Quadro's are probably worth the cost's for better software support and more RAM IMO.

The problem with FPGA's is they cost about as much but take a lot more effort to anywhere close to those performance numbers. However, they are great if you have some vary odd specific needs and plan on moving to custom chips in the future. AKA, you want to build a custom video encoder and plan on mass producing your own chips, so you already need to develop at really low levels.

Estragon · on April 19, 2012

What's a good text for learning to program these? (Or perhaps series of texts, as my knowledge of electronics and computational hardware is very superficial.)

scott_s · on April 19, 2012

In grad school, I took a configurable computing course in the ECE department. I'm a CS guy - I had never done any hardware design before. You may benefit from reading over my short writeups of the assignments: http://people.cs.vt.edu/~scschnei/ece5530/

I recall that in trying to describe the impact of the web to typical business folks, Douglas Adams compared it to trying to explain the ocean to a river: first, you have to understand that river rules no longer apply. Hardware is similar. First, you have to understand that software rules no longer apply. If you dive into this even a little, I predict you will be shocked (much as I was) how much of your concept of "computation" is tied up in sequential, memory-hierarchy based processors.

Estragon · on April 19, 2012

  > If you dive into this even a little, I predict you will be shocked

Thanks, sounds like my kind of ride.

caxap · on April 19, 2012

Good way to start is to learn one of the hardware description languages. I liked the book by Pong P. Chu "FPGA Prototyping by VHDL Examples: Xilinx Spartan-3 Version". The same book is also available for Verilog, which is another HDL. Later on you can take a look into higher level HDLs, since creating hardware in VHDL and Verilog is tedious.

rthomas6 · on April 19, 2012

It might be tedious, but as an EE, I've never seen or heard of anyone using anything besides VHDL and Verilog to describe digital hardware designs. What sort of high level HDLs do people typically use, and for what purpose?

StephenFalken · on April 19, 2012

Try to check SystemC (http://en.wikipedia.org/wiki/SystemC).

HardyLeung · on April 19, 2012

The article mentioned AutoESL, which compiles C, C++, or SystemC to Verilog/VHDL. This allows you to focus on the algorithmic, or behaviorial level. The advantages are plenty, but the main drawback is that it is one more level abstracted away from the hardware..