If the focus of your work is to replace discrete embedded processor blocks with ...

caxap · on Feb 25, 2013

At the moment, I am writing some computer vision code in VHDL. A part of the circuit will perform connected component labeling (CCL) on incoming images, because I want to extract some features from some object in the images. And CCL is actually a union find algorithm. The algorithm can be written in a normal programming language like Racket or even Java in a couple of hours. However, the same algorithm will take me weeks to work out and test in VHDL! I have done some nontrivial work with FPGAs, and every single time it was hard, because every low-level detail has to be considered. Maybe it is so hard because on FPGAs you are forced to optimize right from the start, whereas when using programming languages, you can develop a prototype quickly and then improve upon it? How is your experience with developing stuff on FPGAs?

robomartin · on Feb 25, 2013

I'd have to know more specifics to be able to comment beyond a certain level.

In general terms, yes, FPGA work can and does usually take longer than the equivalent work in the software domain. It doesn't have to be that way though.

For me it starts with language choices. I suppose that if you work in VHDL all the time you probably rock. I have an intense dislike for VHDL. I don't see a reason to type twice as much to do the same thing. Fifteen years ago VHDL had advantages with such constructs as "generate", this is no-longer the case. I realize that this can easily turn into an argument of religious nature, so we'll have to leave it at that.

One approach that I have used with great success with complex modules is to write them in software first and then port to the FPGA. Going between C and Verilog is very natural.

The key is to write C code keeping in mind that you are describing hardware all along. Don't do anything that you would not be able to easily replicate on the FPGA. You are, effectively, authoring a simulation of what you might implement in the FPGA. The beauty of this approach is that you get the advantage of immediate execution and visualization in software. Debug initial structures and assumptions this way to save tons of time.

Maybe the best way to put it is that I try not to use the FPGA HDL coding stage to experiment and create but rather to simply enter the implementation. Then my goal is to go through as few Modelsim simulation passes as possible to verify operation.

If you've done non-trivial FPGA work you have probably experienced the agony of waiting an hour and a half for a design to compiler and another N hours for it to simulate before discovering problems. The write-compile-simulate-evaluate-modify-repeat loop in FPGA work takes orders of magnitude longer than with software. I've had projects where you can only reasonably make one to half-a-dozen code changes per 18 hour day. That's the way it goes.

This is why I've resorted to extensive software-based validation before HDL coding. I've done this with, for example, challenging custom high-performance DDR memory controllers where there was a need to fiddle with a number of parameters and be able to visualize such conditions as FIFO fill/drain levels, etc. A nice GUI on top of the simulation made a huge difference. The final implementation took far less time to code in HDL and worked as required from the very start.

Another general comment. When it comes to image processing in FPGA's you don't really pay a penalty for modularizing your code to a relatively fine-grained degree. This because module interfaces don't necessarily create any overhead (the best example of this being interconnect wires). In that sense FPGA's are vastly different from software in that function or class+method interfaces generally come at a price.

Modularization can produce benefits during synthesis and placement. If you can pre-place portions of your design and do your floor planning in advance you can save tons of time. Incremental compilation has been around for a while. Still, nothing beats getting into the chip and locking down structures when it makes sense.

To circle back to the recurring theme of "FPGA for the masses" that pops-up every so often. I maintain that FPGA's are, fundamentally, still about electrical engineering and not about software development. These, at certain levels, become vastly different disciplines. Once FPGA compilers become 100 to 1,000 times faster and FPGA's come with 100 to 1,000 times more resources for the money the two worlds will probably blur into one very quickly for most applications.

caxap · on Feb 25, 2013

Thanks for your insights, there is a lot of value for me in your post.

I have an intense dislike for VHDL. I have yet to meet an engineer who likes it! I hate it with passion, but it lets me write circuits in the way I want. Luckily, emacs VHDL mode makes me type less.

If you've done non-trivial FPGA work you have probably experienced the agony of waiting an hour and a half for a design to compiler and another N hours for it to simulate before discovering problems. My simulations never took hours. I use GHDL (an open source tool that converts VHDL into C++) to simulate my code, which is much slower than running Modelsim in a virtual machine. So I guess that you are working on much larger problems than I do.

I have tried using a high level language before writing my circuits in VHDL before. But the results were not very good, apart from learning a lot more about the actual algorithm/circuit.

Either I coded at a too high of a level, which would be impossible in an FPGA (e.g., accessing a true dual port block RAM at 3 different addresses in a clock cycle), or I ended up simulating a lot of hardware just to make sure that it will work.

But the point is, no matter which approach I tried, it was painful, so I ended up choosing the workflow that is less painful.

I'd have to know more specifics to be able to comment beyond a certain level. I am developing a marker detection system that runs at 100fps, with 640x480 8-bit grayscale images. First I am doing CCL to find anything in the image that could be a marker. At the same time, some features are accumulated for each detected component (potential marker).

Then the features are used to find which component is a real marker and what's its ID. And finally, the markers have some spacial information that allows me to find out the position and orientation of the camera.

Even though the FPGA that I use is the largest of all Cyclone II FPGAs with 70k LEs, I have to juggle registers and block RAM because it's too small to store all data in the registers, and using up too many registers substantially increases the time to place&route the design.

I maintain that FPGA's are, fundamentally, still about electrical engineering and not about software development. These, at certain levels, become vastly different disciplines. Once FPGA compilers become 100 to 1,000 times faster and FPGA's come with 100 to 1,000 times more resources for the money the two worlds will probably blur into one very quickly for most applications. I agree, and I would add that the compilers need to be smarter about parallelizing the code. So while being able to perform better than the alternatives, the FPGAs are still a pain to develop for. Even if the compilers are faster, and FPGAs are bigger, writing code for FPGAs feels still more like writing assembly code rather than code that is easily accessible "for the masses". But I would be happy if the compilers become just 10x faster!

robomartin · on Feb 26, 2013

> I hate it with passion, but it lets me write circuits in the way I want.

Can you explain what you are doing. I am wondering if you might be making your work more difficult by not taking advantage of inference. Are you doing logic-element level hardware description? In other words, are you wiring the circuits by hand, if you will, by describing everything in VHDL?

I've done that of course, but I don't think it's necessary unless you really have to squeeze a lot out of a design. Where it works well is in doing your own hand-placement and hand-routing thorough switch boxes, etc. to get a super-tight design that runs like hell. I've done that mostly with adders and multipliers in the context of filter structures.

My guess is that you have setup several delay lines in order to process a kernel of NxM pixels at a time?

It's been a while but I recall doing a fairly complex shallow diagonal edge detector that had to look at 16 x 16 pixel blocks in order to do its job. This ended-up taking the form of using internal storage in a large FPGA to build a 16 line FIFO with output taps every line. Now you could read a full 16 lines vertical chunk-o-pixels into the shallow edge processor and let it do its thing.

The fact that you are working on a 70k LE Cyclone imposes certain limits, not the least of which is internal memory availability. I haven't used a Cyclone in a long time, I'd have to look and see what resources you might have. That could very well be the source of much of your pain. Don't know.

swah · on Feb 26, 2013

6+ hours to compile was the longest I've seen/worked with. The problem IIRC were the large FPGAs, not that much the large designs.

robomartin · on Feb 26, 2013

With dense designs you can easily run into what feels like O(n!) time, which is probably close to how complex the problem might actually become.

VLM · on Feb 25, 2013

I would talk to these guys (unless you are one of them) working on extending their results

http://ieeexplore.ieee.org/xpl/articleDetails.jsp?arnumber=6...

The wikipedia entry also has a link to a parallelizable algo from 20+ years ago for CCL. FPGAs certainly parallel pretty easily. I wonder if your simplified optimum solution is to calculate one cell and replicate into 20x20 matrix or whatever you can fit on your FPGA and then have a higher level CPU sling work units and stitch overlapping parts together.

More practically I'd suggest your quick prototype would be slap a SoC on a FPGA that does it in your favorite low-ish level code, since it only takes hours, then very methodically and smoothly create an acceleration peripheral that begins to do the grunt-iest of the grunt work one little step at a time.

So lets start with just are there any connections at all? That seems a blindingly simple optimization. Well thats a bitwise comparison, so replace that in your code with a hardware detection and flag. Next thing you know you've got a counter that automatically in hardware skips past all blank space into the first possible pixel... But thats an optimization, maybe not the best place to start.

Next I suppose if you're doing 4-connected you have some kind of inner loop that looks a lot like the wikipedia list of 4 possible conditions. Now rather than having the on FPGA cpu compare if you're in the same region one direction at a time, do all 4 dirs at once in parallel in VHDL and output the result in hardware to your code, and your code reads it all in and decides which step (if any) was the lowest/first success.

The next step is obviously move the "whats the first step to succeed?" question outta the software and into the VHDL, so the embedded proc thinks, OK just read one register to see if its connected and if so in which direction.

Then you start feeding in a stream and setting up a (probably painful) pipeline.

This is a solid bottom up approach. One painful low level detail at a time, only one at a time, never more than one at a time. Often this is a method to find a local maximum, its never going to improve the algo (although it'll make it faster...)

"because on FPGAs you are forced to optimize right from the start" Don't do that. Emulate something that works from the start, then create an acceleration peripheral to simplify your SoC code. Eventually remove your onboard FPGA cpu if you're going to interface externally to something big, once the "accelerator" is accelerating enough.

Imagine building your own floating point mult instead of using an off the shelf one ... you don't write the control blocks and control code in VHDL and do the adders later... your first step should be writing a fast adder only later replacing control code and simulated pipelining with VHDL code. You write the full adder first, not the fast carry, or whatever.

caxap · on Feb 25, 2013

No, I am not one of them :) Thanks for the reference! I am drawing my inspiration from Bailey, and more recently Ma et al. They label an image line by line and merge the labels during the blanking period. If you start merging labels while the image is processed then data might get lost if the merged label occurs after the merge.

The paper that you reference divides the image into regions, so that the merging can start earlier, because labels used in one region are independent of the other regions. If it starts earlier, it also ends earlier, so that new data can be processed.

In my case, there is no need for such high performance, just a real time requirement of 100fps for 640x480 images, where CCL is used for feature extraction. The work by Bailey and his group is good enough, and the reference can be done in the future, if there is need for more throughput!

My workflow is a lot different from the one that you describe. I don't use any soft cores, and write everything in VHDL! I have used soft cores before, but they were kind of not to my liking. I miss the short feedback loop (my PC is a Mac and the synthesis tools run in a VM).

After trying out a couple of environments, I ended up using open source tools---GHDL for VHDL->C++ compilation and simulation, and GTKwave for waveform inspection.

Usually, I start with a testbench a testbench that instantiates my empty design under test. The testbench reads some test image that I draw in photoshop. It prints some debugging values, and the wave inspection helps to figure out what's going on.

If it works in the simulator, it usually works on the FPGA! But the biggest advantage is that it takes just some seconds to do all that.

I will give the softcore approach another chance once my deadline is over!

robomartin · on Feb 25, 2013

One quick note. Sometimes in image processing you can gain advantages by frame-buffering (to external SDR or DDR memory, not internal resources) and then operating on the data at many times the native video clock rate.

If your data is coming in at 13.5MHz and you can run your internal evaluation core at 500MHz there's a lot you can do that, all of a sudden, appears "magical".

VLM · on Feb 25, 2013

While eating lunch I was thinking about your CCL and a simple 4-way CCL reminds me of the old "put the game-of-life" on a FPGA deal. So what if you model each pixel as a cell, and if you're set "on" then either propagate a GUID to the southeast cells, or if you got a GUID from the northwest cells, then propagate that GUID instead of your own? If you're on, propagate a zero to the southeast? Whats a good GUID? Probably some combo of your pixel's X/Y coord and/or just a (very large) random number.

FPGA's do cellular automata pretty well because you can create an ever larger matrix of them until you run into some hardware limit.

This is not exactly what you're trying to do, but it sure is simple and a possible start. I'm guessing when you're done you'll end up with a really smart peripheral that looks like a CA accelerator.

caxap · on Feb 25, 2013

That's perfectly possible, but only the newer FPGAs are big enough to store the whole image in the registers. If I had a bigger FPGA, I would not bother doing all this memory juggling that I am doing now and place all my data into the registers. And then wait for 10 hours for the software to produce the bitstream!

Probably some combo of your pixel's X/Y coord and/or just a (very large) random number.

I would go with X/Y because it requires less memory than a random number. Besides, random numbers on FPGAs need extra (though not much!) logic to produce them in LFSRs.

VLM · on Feb 25, 2013

I agree with your design comments WRT approximately 0.1% (or less) of truly exotic level embedded work, like for military aerospace. That does not invalidate it for the other 99.9% of embedded devs.

To a greater or lesser extent its just fear of the unknown. I could subject your post to copy and paste conversion and it would ring true with the conversion from mech timers to electronic contro, or discretes to ICs, or microcode based CPUs, or SBC microprocessors to single chip SoC microcontrollers, etc. The industry will adjust, over time.

"Anyone in software has had the experience of using some open-source module to save time only to end-up paying for it dearly when something doesn't work correctly and help isn't forthcoming."

LOL write it yourself merely means you reinvent the wheel complete with having to discover and patch all the obvious bugs first, before you even begin to catch up to the hard bugs.

"simply wiring together a bunch of includes"

What I'm getting at, is much as no one would be crazy enough to write their own homemade Perl database driver instead of using the world's universal standard to do the job, no one in FPGA land is crazy enough to write their own Z80 core when the T80 core at opencores has about a decade of R+D, and more importantly debugging, behind it. Plus or minus crazy regulatory/licensing requirements of course.

"If you don't understand logic circuits FPGA's are voodoo" Yes insane race conditions and clocking issues are "fun". Fast digital is very much analog that is hidden behind the curtain... "...ignore the man behind the curtain..." Then again my car transmission, my wife's coffee maker, my clothes dryer, my microwave, and my dishwasher will never, ever test the boundaries of modern digital logic speeds so we're back at the 99.9% vs 0.01% argument again.

"you can do nearly everything in web development via freely available includes."

Well, yeah. What I am getting at is writing your own homemade clone of script.aculo.us or buying an expensive clone of it would be a complete disaster compared to just including "the real thing" and using it.

robomartin · on Feb 25, 2013

You've mentioned using processor cores a few times now. I'll assume, perhaps wrongly, that this might be a representation of your world when it comes to using FPGA's.

That is not my world at all. My applications have never had the luxury of being able to simply import an 8 bit processor core and a few peripherals and off we go into software land. Nearly everything I've done has been in two domains: real time image processing in hardware or beam-forming applications. In all cases virtually no use of canned modules could be made or justified. Sure, there's the SPI's and I2C's and a few other knick-knacks, but that's about it.

In fact, most of the applications I've done would end-up with external physical embedded processors because the high-speed FPGA resources could not be spared for low speed "command and control" work.

Maybe I'm living in that 0.1% you referred to?

Surely there are people doing more mundane things such as motor control, glue logic or battery chargers who might benefit from wiring together a complete custom embedded system within single Spartan FPGA and life is great. I can see that being a possibility. It just hasn't been part of my reality, for better or worst.