Designing a sorting network with an FSM as opposed to combinational logic

Question 1

During this semester I have been spending a bit of time learning how to use Verilog. I took on a project to develop a sorting network for 1024 32-bit numbers and tried to develop the circuit by developing "modules" that represent each part of the sorting network. I based my module on the picture at the bottom of this wikipedia page: https://en.wikipedia.org/wiki/Bitonic_sorter

My module description is below, where the parameters represent each of the "blocks" and "stages" from the figure shown in wikipedia.

module layerSort #(parameter power = 5, parameter POWER = 10)(input [32767:0] in, output reg [32767:0] out);
 /* power parameter represents 
 the power of 2 our starting width is. E.g
 this module starts by breadth-sorting w/ size 
 2^2 = 4. */
 reg[32*2**POWER-1:0] layer[power-1:0];
 generate
 genvar i,j,k;
 /* 0th layer is a 'breadth' swap. The kth last number of the jth block is compared with the kth first
 of the jth block. */
 /* Values assigned to layer 0 */ 
 for(j=0; j<2**(POWER-power); j=j+1) begin : first_j
 /* The jth block 'starts' at index 1023-j*2^{p-1}*/
 /* The jth block 'ends' at index (j+1)*2^p-1 + 1 */ 
 localparam jStart = 32*(2**(POWER)-j*2**(power))-1;
 localparam jEnd = 32*(2**(POWER)-(j+1)*2**(power));
 for(k=0; k<2**(power-1); k=k+1) begin : first_k
 always @(in)begin
 if(in[jStart-32*k:jStart-32*(k+1)+1] <= in[jEnd+32*(k+1)-1:jEnd+32*k]) begin 
 layer[0][jStart-32*k:jStart-32*(k+1)+1] <= in[jStart-32*k:jStart-32*(k+1)+1];
 layer[0][jEnd+32*(k+1)-1:jEnd+32*k] <= in[jEnd+32*(k+1)-1:jEnd+32*k];
 end else begin 
 layer[0][jEnd+32*(k+1)-1:jEnd+32*k] <= in[jStart-32*k:jStart-32*(k+1)+1];
 layer[0][jStart-32*k:jStart-32*(k+1)+1] <= in[jEnd+32*(k+1)-1:jEnd+32*k];
 end
 end
 end
 end
 /* I believe the modules all match up here...? */
 for(i=1; i < power; i=i+1) begin : i_loop
 for (j=0; j<2**(POWER-power+i); j=j+1) begin : j_loop
 /* If the module is of 'width' of size 2^(power-i) 
 Then the hopscotch jumps are of size 2^(power-i-1) 
 that is to say, half the width! */ 
 localparam FirstStart = 32*(2**(POWER)-j*2**(power-i))-1; 
 localparam SecondStart = FirstStart-32*(2**(power-i-1));
 for (k=0; k < 2**(power-i-1); k=k+1) begin : k_loop
 always@(layer[i-1])begin
 if(layer[i-1][FirstStart-32*k:FirstStart-32*(k+1)+1] <= layer[i-1][SecondStart-32*k:SecondStart-32*(k+1)+1]) begin 
 layer[i][FirstStart-32*k:FirstStart-32*(k+1)+1]<= layer[i-1][FirstStart-32*k:FirstStart-32*(k+1)+11];
 layer[i][SecondStart-32*k:SecondStart-32*(k+1)+1] <= layer[i-1][SecondStart-32*k:SecondStart-32*(k+1)+1];
 end else begin
 layer[i][SecondStart-32*k:SecondStart-32*(k+1)+1]<= layer[i-1][FirstStart-32*k:FirstStart-32*(k+1)+11];
 layer[i][FirstStart-32*k:FirstStart-32*(k+1)+1] <= layer[i-1][SecondStart-32*k:SecondStart-32*(k+1)+1];
 end
 end
 end
 end
 end
 /* Assign final output layer as out */ 
 always @(layer[power-1]) begin
 out <= layer[power-1];
 end 
 endgenerate
endmodule

The bad news is that the network never actually outputs anything. None of the inbetween layers (except the first one) update their values in my circuit when I try to testbench it. I have a feeling that I am doing something fundamentally wrong in my design -- and have a feeling that all my combinational logic is my downfall here, but my background is a bit lacking. That is to say, I have a feeling that my first mistake was trying to make this module a big combo of genvars in the first place.

After some thinking, I imagined that I could instead focus on turning this sorting network into a sort of state machine with inspiration from https://github.com/mcjtag/bitonic_sorter/tree/master -- because I feel like the use of a clock to update the values in the circuit is important. Why does a clock need to be used here? Why can't you just "wire" everything together and get one output?

Question 2

Start with a network that just passes data unchanged. One layer. Then add another layer, also identity. Then several. Then make one of those layers do something trivial like inverting every other bit in the 32kbit word. Then slowly add actual sorting to the mix. After each small change test to make sure it still works, i.e. data flows from input to output and the values are as expected given the logic.

Question 3

Why does a clock need to be used here?

The Github link states it is "fully pipelined". A pipelined design employs a clock signal. One benefit of a pipelined design is that it can generally achieve higher speeds than using combinational logic alone. If the combinational logic is large, then the critical timing path through the design can be large, thereby limiting the speed.

I'm unfamiliar with the algorithm, but the wiki link describes it as parallel, which lends itself to pipelining.

Question 4

A pipelined design generally has higher throughput, but worse latency than a combinatorial design. The critical timing path through the combinatorial design will always be shorter than the equivalent latency through the pipeline. When designing combinatorial logic, pipelining can be achieved without a clock, in asynchronous fashion. Instead of flip-flops between the stages, there are delay elements that align the outputs of each stage in time, so that they enter the following stage at the same time, and so on. The only FFs are needed on the front and back of the network then.

Question 5

That is not too easy to do in FPGAs since closed source tools have poor-to-none support of this, and open source tools would need modifications, perhaps significant. In an ASIC or on a breadboard it's no big deal to do though.

toolic toolic 10.8k11 gold badges31 silver badges35 bronze badges · Accepted Answer · 2023-11-05 03:14:53Z

0

\$\begingroup\$

Why does a clock need to be used here?

The Github link states it is "fully pipelined". A pipelined design employs a clock signal. One benefit of a pipelined design is that it can generally achieve higher speeds than using combinational logic alone. If the combinational logic is large, then the critical timing path through the design can be large, thereby limiting the speed.

I'm unfamiliar with the algorithm, but the wiki link describes it as parallel, which lends itself to pipelining.

Share

edited Nov 5, 2023 at 3:27

answered Nov 5, 2023 at 3:14

toolic's user avatar

toolic toolic

10.8k11 gold badges31 silver badges35 bronze badges

\$\endgroup\$

2

\$\begingroup\$ A pipelined design generally has higher throughput, but worse latency than a combinatorial design. The critical timing path through the combinatorial design will always be shorter than the equivalent latency through the pipeline. When designing combinatorial logic, pipelining can be achieved without a clock, in asynchronous fashion. Instead of flip-flops between the stages, there are delay elements that align the outputs of each stage in time, so that they enter the following stage at the same time, and so on. The only FFs are needed on the front and back of the network then. \$\endgroup\$

Kuba hasn't forgotten Monica
– Kuba hasn't forgotten Monica

2024年01月14日 17:58:03 +00:00
Commented Jan 14, 2024 at 17:58
\$\begingroup\$ That is not too easy to do in FPGAs since closed source tools have poor-to-none support of this, and open source tools would need modifications, perhaps significant. In an ASIC or on a breadboard it's no big deal to do though. \$\endgroup\$

Kuba hasn't forgotten Monica
– Kuba hasn't forgotten Monica

2024年01月14日 17:58:46 +00:00
Commented Jan 14, 2024 at 17:58

Add a comment |

Stack Exchange Network

Designing a sorting network with an FSM as opposed to combinational logic

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Designing a sorting network with an FSM as opposed to combinational logic

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions