Respected members, I want to use a sorting technique which sorts N numbers using Verilog taking minimum clock cycles(less Time Complexity) as possible.
Thus, I want to get some help regarding the methodology and the Type of sorting technique I should follow.
Regarding the application, it is somewhat similar to shuffle image pixels for e.g. I want to sort 64 image pixels for 256X256 image extracted at a time which equals 1024 times. So sorting of 64 8-bit data 1024 times which is the requirement.
Lastly, If I use radix sort, will be fruitful in order to achieve O(n) time complexity (for N Keys N clock Cycles)?
-
\$\begingroup\$ So you are sorting, what, a series of registers? How many? \$\endgroup\$user57037– user570372017年01月12日 07:21:52 +00:00Commented Jan 12, 2017 at 7:21
-
\$\begingroup\$ Sir, I've N keys (or positive integers) which I need to sort in descending or ascending order \$\endgroup\$Raj– Raj2017年01月12日 07:25:49 +00:00Commented Jan 12, 2017 at 7:25
-
\$\begingroup\$ hackaday.com/2016/01/20/… \$\endgroup\$Claudio Avi Chami– Claudio Avi Chami2017年01月12日 08:42:51 +00:00Commented Jan 12, 2017 at 8:42
-
\$\begingroup\$ @ClaudioAviChami.. it is a 3-sorter implemented on FPGA but using it to N numbers in itself is a bit complex as mentioned by the author. Thank you for your time \$\endgroup\$Raj– Raj2017年01月12日 09:21:16 +00:00Commented Jan 12, 2017 at 9:21
-
1\$\begingroup\$ ... Just as a couple of examples, I work in real-time HD video. In one place, I need to sort nine 10-bit numbers (median filter), with a complete result available on each clock cycle. In other words, throughput is paramount, and latency is less important. In another place, I need to sort a thousand 24-bit numbers (the output of a histogram), where space efficiency is more important than time, but the latency still needs to be bounded. The algorithms and implementations are very different. \$\endgroup\$Dave Tweed– Dave Tweed2017年01月12日 14:08:50 +00:00Commented Jan 12, 2017 at 14:08
2 Answers 2
First off don't think too much about "Big O". You have a small fixed size input.
Secondly think about whether you are more concerned about throughput or latency.
Thirdly note that HDL design is always about space-time tradeoffs, do you really need it to be "as fast as possible"? or do you have space constraints too.
If the number of possible values is small then the radix sort can have significantly lower worst-case algorithmic complexity (an in real-time systems worse case is what you care about) than any "comparision sort" but it's tricky to paralellise.
The proposal at http://hackaday.com/2016/01/20/a-linear-time-sorting-algorithm-for-fpgas/ has much lower latency than dave's radix sort proposal. It's probablly simpler to implement too since daves has loads of corner-cases to deal with regarding fifo's and paralleising the middle stage.
Why? it's not because the algorithm requires less operations, indeed it requires considerablly more. However those operations are highly paralellised. It's basically an insertion sort, but rather than checking possible insertion places one at a time and then shifting values one at a time if a match is found it performs those checks and shifts in paralell.
The downside is it's going to eat up more logic. Rather than locations in blockram you have a relatively complex peice of logic for each value in the sorter.
Yes, a radix sort would make a lot of sense for 8-bit data. It would operate in three phases and require two small blocks of memory.
The first block of memory would have 256 locations. It would be used to count how many times each data value appears in the group of pixels being sorted. If you're sorting 64 pixels at a time, then this memory would have to be 7 bits wide. (Actually, for continuous operation, you'll want this memory to be double-buffered, so 2×256 locations.)
The second block of memory is operated as a FIFO. Each entry in this FIFO is a value:count pair, where the value is 8 bits wide and the count is 7 bits wide, for a total width of 15 bits. This FIFO needs to be at least 64 words deep, in order to handle the case where each input value only appeared once.
Start by clearing the counters for the first group of pixels. After that, the three phases are:
As each pixel arrives, increment the corresponding counter.
Scan through the counters and whenever you have a nonzero count, transfer the value:count pair to the FIFO. Also, clear the counters at this time for the next pass. Note that this is where you choose the direction of the sort, by which way you scan through the counters.
Read the value:count pairs out of the FIFO, and generate count copies of each value at the output.
I just realized that there's a problem here in terms of timing: the first and third phases require 64 clocks each, while the middle phase requires 256 clocks. This means that for continuous operation, the first memory will need to be 4 separate memories with 256 locations each, you'll need to have four interleaved copies of the middle phase, each feeding its own FIFO, and the third phase will have to switch sequentially among the FIFO outputs. (Note that this problem does not arise if the number of pixels in each group is at least as large as the number of values.)
This implementation would be able to process input and output pixels continuously, and the total latency from input to output would be 64 + 256 = 320 clocks.