Sorting using verilog

Question 1

Respected members, I want to use a sorting technique which sorts N numbers using Verilog taking minimum clock cycles(less Time Complexity) as possible.

Thus, I want to get some help regarding the methodology and the Type of sorting technique I should follow.

Regarding the application, it is somewhat similar to shuffle image pixels for e.g. I want to sort 64 image pixels for 256X256 image extracted at a time which equals 1024 times. So sorting of 64 8-bit data 1024 times which is the requirement.

Lastly, If I use radix sort, will be fruitful in order to achieve O(n) time complexity (for N Keys N clock Cycles)?

Question 2

So you are sorting, what, a series of registers? How many?

Question 3

Sir, I've N keys (or positive integers) which I need to sort in descending or ascending order

Question 4

hackaday.com/2016/01/20/…

Question 5

@ClaudioAviChami.. it is a 3-sorter implemented on FPGA but using it to N numbers in itself is a bit complex as mentioned by the author. Thank you for your time

Question 6

... Just as a couple of examples, I work in real-time HD video. In one place, I need to sort nine 10-bit numbers (median filter), with a complete result available on each clock cycle. In other words, throughput is paramount, and latency is less important. In another place, I need to sort a thousand 24-bit numbers (the output of a histogram), where space efficiency is more important than time, but the latency still needs to be bounded. The algorithms and implementations are very different.

Question 7

First off don't think too much about "Big O". You have a small fixed size input.

Secondly think about whether you are more concerned about throughput or latency.

Thirdly note that HDL design is always about space-time tradeoffs, do you really need it to be "as fast as possible"? or do you have space constraints too.

If the number of possible values is small then the radix sort can have significantly lower worst-case algorithmic complexity (an in real-time systems worse case is what you care about) than any "comparision sort" but it's tricky to paralellise.

The proposal at http://hackaday.com/2016/01/20/a-linear-time-sorting-algorithm-for-fpgas/ has much lower latency than dave's radix sort proposal. It's probablly simpler to implement too since daves has loads of corner-cases to deal with regarding fifo's and paralleising the middle stage.

Why? it's not because the algorithm requires less operations, indeed it requires considerablly more. However those operations are highly paralellised. It's basically an insertion sort, but rather than checking possible insertion places one at a time and then shifting values one at a time if a match is found it performs those checks and shifts in paralell.

The downside is it's going to eat up more logic. Rather than locations in blockram you have a relatively complex peice of logic for each value in the sorter.

Question 8

Yes, a radix sort would make a lot of sense for 8-bit data. It would operate in three phases and require two small blocks of memory.

The first block of memory would have 256 locations. It would be used to count how many times each data value appears in the group of pixels being sorted. If you're sorting 64 pixels at a time, then this memory would have to be 7 bits wide. (Actually, for continuous operation, you'll want this memory to be double-buffered, so 2×256 locations.)

The second block of memory is operated as a FIFO. Each entry in this FIFO is a value:count pair, where the value is 8 bits wide and the count is 7 bits wide, for a total width of 15 bits. This FIFO needs to be at least 64 words deep, in order to handle the case where each input value only appeared once.

Start by clearing the counters for the first group of pixels. After that, the three phases are:

As each pixel arrives, increment the corresponding counter.
Scan through the counters and whenever you have a nonzero count, transfer the value:count pair to the FIFO. Also, clear the counters at this time for the next pass. Note that this is where you choose the direction of the sort, by which way you scan through the counters.
Read the value:count pairs out of the FIFO, and generate count copies of each value at the output.

I just realized that there's a problem here in terms of timing: the first and third phases require 64 clocks each, while the middle phase requires 256 clocks. This means that for continuous operation, the first memory will need to be 4 separate memories with 256 locations each, you'll need to have four interleaved copies of the middle phase, each feeding its own FIFO, and the third phase will have to switch sequentially among the FIFO outputs. (Note that this problem does not arise if the number of pixels in each group is at least as large as the number of values.)

This implementation would be able to process input and output pixels continuously, and the total latency from input to output would be 64 + 256 = 320 clocks.

Peter Green Peter Green 23.7k1 gold badge42 silver badges87 bronze badges · Accepted Answer · 2017-01-13 14:55:33Z

First off don't think too much about "Big O". You have a small fixed size input.

Secondly think about whether you are more concerned about throughput or latency.

Thirdly note that HDL design is always about space-time tradeoffs, do you really need it to be "as fast as possible"? or do you have space constraints too.

If the number of possible values is small then the radix sort can have significantly lower worst-case algorithmic complexity (an in real-time systems worse case is what you care about) than any "comparision sort" but it's tricky to paralellise.

The proposal at http://hackaday.com/2016/01/20/a-linear-time-sorting-algorithm-for-fpgas/ has much lower latency than dave's radix sort proposal. It's probablly simpler to implement too since daves has loads of corner-cases to deal with regarding fifo's and paralleising the middle stage.

Why? it's not because the algorithm requires less operations, indeed it requires considerablly more. However those operations are highly paralellised. It's basically an insertion sort, but rather than checking possible insertion places one at a time and then shifting values one at a time if a match is found it performs those checks and shifts in paralell.

The downside is it's going to eat up more logic. Rather than locations in blockram you have a relatively complex peice of logic for each value in the sorter.

Stack Exchange Network

Sorting using verilog

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Sorting using verilog

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions