The wikipedia entry also has a link to a parallelizable algo from 20+ years ago for CCL. FPGAs certainly parallel pretty easily. I wonder if your simplified optimum solution is to calculate one cell and replicate into 20x20 matrix or whatever you can fit on your FPGA and then have a higher level CPU sling work units and stitch overlapping parts together.
More practically I'd suggest your quick prototype would be slap a SoC on a FPGA that does it in your favorite low-ish level code, since it only takes hours, then very methodically and smoothly create an acceleration peripheral that begins to do the grunt-iest of the grunt work one little step at a time.
So lets start with just are there any connections at all? That seems a blindingly simple optimization. Well thats a bitwise comparison, so replace that in your code with a hardware detection and flag. Next thing you know you've got a counter that automatically in hardware skips past all blank space into the first possible pixel... But thats an optimization, maybe not the best place to start.
Next I suppose if you're doing 4-connected you have some kind of inner loop that looks a lot like the wikipedia list of 4 possible conditions. Now rather than having the on FPGA cpu compare if you're in the same region one direction at a time, do all 4 dirs at once in parallel in VHDL and output the result in hardware to your code, and your code reads it all in and decides which step (if any) was the lowest/first success.
The next step is obviously move the "whats the first step to succeed?" question outta the software and into the VHDL, so the embedded proc thinks, OK just read one register to see if its connected and if so in which direction.
Then you start feeding in a stream and setting up a (probably painful) pipeline.
This is a solid bottom up approach. One painful low level detail at a time, only one at a time, never more than one at a time. Often this is a method to find a local maximum, its never going to improve the algo (although it'll make it faster...)
"because on FPGAs you are forced to optimize right from the start" Don't do that. Emulate something that works from the start, then create an acceleration peripheral to simplify your SoC code. Eventually remove your onboard FPGA cpu if you're going to interface externally to something big, once the "accelerator" is accelerating enough.
Imagine building your own floating point mult instead of using an off the shelf one ... you don't write the control blocks and control code in VHDL and do the adders later... your first step should be writing a fast adder only later replacing control code and simulated pipelining with VHDL code. You write the full adder first, not the fast carry, or whatever.
No, I am not one of them :) Thanks for the reference! I am drawing my inspiration from Bailey, and more recently Ma et al.
They label an image line by line and merge the labels during the blanking period.
If you start merging labels while the image is processed then data might get lost if the merged label occurs after the merge.
The paper that you reference divides the image into regions, so that the merging can start earlier, because labels used in one region are independent of the other regions.
If it starts earlier, it also ends earlier, so that new data can be processed.
In my case, there is no need for such high performance, just a real time requirement of 100fps for 640x480 images, where CCL is used for feature extraction.
The work by Bailey and his group is good enough, and the reference can be done in the future, if there is need for more throughput!
My workflow is a lot different from the one that you describe.
I don't use any soft cores, and write everything in VHDL!
I have used soft cores before, but they were kind of not to my liking.
I miss the short feedback loop (my PC is a Mac and the synthesis tools run in a VM).
After trying out a couple of environments, I ended up using open source tools---GHDL for VHDL->C++ compilation and simulation, and GTKwave for waveform inspection.
Usually, I start with a testbench a testbench that instantiates my empty design under test.
The testbench reads some test image that I draw in photoshop.
It prints some debugging values, and the wave inspection helps to figure out what's going on.
If it works in the simulator, it usually works on the FPGA!
But the biggest advantage is that it takes just some seconds to do all that.
I will give the softcore approach another chance once my deadline is over!
One quick note. Sometimes in image processing you can gain advantages by frame-buffering (to external SDR or DDR memory, not internal resources) and then operating on the data at many times the native video clock rate.
If your data is coming in at 13.5MHz and you can run your internal evaluation core at 500MHz there's a lot you can do that, all of a sudden, appears "magical".
While eating lunch I was thinking about your CCL and a simple 4-way CCL reminds me of the old "put the game-of-life" on a FPGA deal. So what if you model each pixel as a cell, and if you're set "on" then either propagate a GUID to the southeast cells, or if you got a GUID from the northwest cells, then propagate that GUID instead of your own? If you're on, propagate a zero to the southeast? Whats a good GUID? Probably some combo of your pixel's X/Y coord and/or just a (very large) random number.
FPGA's do cellular automata pretty well because you can create an ever larger matrix of them until you run into some hardware limit.
This is not exactly what you're trying to do, but it sure is simple and a possible start. I'm guessing when you're done you'll end up with a really smart peripheral that looks like a CA accelerator.
That's perfectly possible, but only the newer FPGAs are big enough to store the whole image in the registers.
If I had a bigger FPGA, I would not bother doing all this memory juggling that I am doing now and place all my data into the registers.
And then wait for 10 hours for the software to produce the bitstream!
Probably some combo of your pixel's X/Y coord and/or just a (very large) random number.
I would go with X/Y because it requires less memory than a random number.
Besides, random numbers on FPGAs need extra (though not much!) logic to produce them in LFSRs.
http://ieeexplore.ieee.org/xpl/articleDetails.jsp?arnumber=6...
The wikipedia entry also has a link to a parallelizable algo from 20+ years ago for CCL. FPGAs certainly parallel pretty easily. I wonder if your simplified optimum solution is to calculate one cell and replicate into 20x20 matrix or whatever you can fit on your FPGA and then have a higher level CPU sling work units and stitch overlapping parts together.
More practically I'd suggest your quick prototype would be slap a SoC on a FPGA that does it in your favorite low-ish level code, since it only takes hours, then very methodically and smoothly create an acceleration peripheral that begins to do the grunt-iest of the grunt work one little step at a time.
So lets start with just are there any connections at all? That seems a blindingly simple optimization. Well thats a bitwise comparison, so replace that in your code with a hardware detection and flag. Next thing you know you've got a counter that automatically in hardware skips past all blank space into the first possible pixel... But thats an optimization, maybe not the best place to start.
Next I suppose if you're doing 4-connected you have some kind of inner loop that looks a lot like the wikipedia list of 4 possible conditions. Now rather than having the on FPGA cpu compare if you're in the same region one direction at a time, do all 4 dirs at once in parallel in VHDL and output the result in hardware to your code, and your code reads it all in and decides which step (if any) was the lowest/first success.
The next step is obviously move the "whats the first step to succeed?" question outta the software and into the VHDL, so the embedded proc thinks, OK just read one register to see if its connected and if so in which direction.
Then you start feeding in a stream and setting up a (probably painful) pipeline.
This is a solid bottom up approach. One painful low level detail at a time, only one at a time, never more than one at a time. Often this is a method to find a local maximum, its never going to improve the algo (although it'll make it faster...)
"because on FPGAs you are forced to optimize right from the start" Don't do that. Emulate something that works from the start, then create an acceleration peripheral to simplify your SoC code. Eventually remove your onboard FPGA cpu if you're going to interface externally to something big, once the "accelerator" is accelerating enough.
Imagine building your own floating point mult instead of using an off the shelf one ... you don't write the control blocks and control code in VHDL and do the adders later... your first step should be writing a fast adder only later replacing control code and simulated pipelining with VHDL code. You write the full adder first, not the fast carry, or whatever.