During this semester I have been spending a bit of time learning how to use Verilog. I took on a project to develop a sorting network for 1024 32-bit numbers and tried to develop the circuit by developing "modules" that represent each part of the sorting network. I based my module on the picture at the bottom of this wikipedia page: https://en.wikipedia.org/wiki/Bitonic_sorter
My module description is below, where the parameters represent each of the "blocks" and "stages" from the figure shown in wikipedia.
module layerSort #(parameter power = 5, parameter POWER = 10)(input [32767:0] in, output reg [32767:0] out);
/* power parameter represents
the power of 2 our starting width is. E.g
this module starts by breadth-sorting w/ size
2^2 = 4. */
reg[32*2**POWER-1:0] layer[power-1:0];
generate
genvar i,j,k;
/* 0th layer is a 'breadth' swap. The kth last number of the jth block is compared with the kth first
of the jth block. */
/* Values assigned to layer 0 */
for(j=0; j<2**(POWER-power); j=j+1) begin : first_j
/* The jth block 'starts' at index 1023-j*2^{p-1}*/
/* The jth block 'ends' at index (j+1)*2^p-1 + 1 */
localparam jStart = 32*(2**(POWER)-j*2**(power))-1;
localparam jEnd = 32*(2**(POWER)-(j+1)*2**(power));
for(k=0; k<2**(power-1); k=k+1) begin : first_k
always @(in)begin
if(in[jStart-32*k:jStart-32*(k+1)+1] <= in[jEnd+32*(k+1)-1:jEnd+32*k]) begin
layer[0][jStart-32*k:jStart-32*(k+1)+1] <= in[jStart-32*k:jStart-32*(k+1)+1];
layer[0][jEnd+32*(k+1)-1:jEnd+32*k] <= in[jEnd+32*(k+1)-1:jEnd+32*k];
end else begin
layer[0][jEnd+32*(k+1)-1:jEnd+32*k] <= in[jStart-32*k:jStart-32*(k+1)+1];
layer[0][jStart-32*k:jStart-32*(k+1)+1] <= in[jEnd+32*(k+1)-1:jEnd+32*k];
end
end
end
end
/* I believe the modules all match up here...? */
for(i=1; i < power; i=i+1) begin : i_loop
for (j=0; j<2**(POWER-power+i); j=j+1) begin : j_loop
/* If the module is of 'width' of size 2^(power-i)
Then the hopscotch jumps are of size 2^(power-i-1)
that is to say, half the width! */
localparam FirstStart = 32*(2**(POWER)-j*2**(power-i))-1;
localparam SecondStart = FirstStart-32*(2**(power-i-1));
for (k=0; k < 2**(power-i-1); k=k+1) begin : k_loop
always@(layer[i-1])begin
if(layer[i-1][FirstStart-32*k:FirstStart-32*(k+1)+1] <= layer[i-1][SecondStart-32*k:SecondStart-32*(k+1)+1]) begin
layer[i][FirstStart-32*k:FirstStart-32*(k+1)+1]<= layer[i-1][FirstStart-32*k:FirstStart-32*(k+1)+11];
layer[i][SecondStart-32*k:SecondStart-32*(k+1)+1] <= layer[i-1][SecondStart-32*k:SecondStart-32*(k+1)+1];
end else begin
layer[i][SecondStart-32*k:SecondStart-32*(k+1)+1]<= layer[i-1][FirstStart-32*k:FirstStart-32*(k+1)+11];
layer[i][FirstStart-32*k:FirstStart-32*(k+1)+1] <= layer[i-1][SecondStart-32*k:SecondStart-32*(k+1)+1];
end
end
end
end
end
/* Assign final output layer as out */
always @(layer[power-1]) begin
out <= layer[power-1];
end
endgenerate
endmodule
The bad news is that the network never actually outputs anything. None of the inbetween layers (except the first one) update their values in my circuit when I try to testbench it. I have a feeling that I am doing something fundamentally wrong in my design -- and have a feeling that all my combinational logic is my downfall here, but my background is a bit lacking. That is to say, I have a feeling that my first mistake was trying to make this module a big combo of genvars in the first place.
After some thinking, I imagined that I could instead focus on turning this sorting network into a sort of state machine with inspiration from https://github.com/mcjtag/bitonic_sorter/tree/master -- because I feel like the use of a clock to update the values in the circuit is important. Why does a clock need to be used here? Why can't you just "wire" everything together and get one output?
-
\$\begingroup\$ Start with a network that just passes data unchanged. One layer. Then add another layer, also identity. Then several. Then make one of those layers do something trivial like inverting every other bit in the 32kbit word. Then slowly add actual sorting to the mix. After each small change test to make sure it still works, i.e. data flows from input to output and the values are as expected given the logic. \$\endgroup\$Kuba hasn't forgotten Monica– Kuba hasn't forgotten Monica2024年01月14日 18:00:45 +00:00Commented Jan 14, 2024 at 18:00
1 Answer 1
Why does a clock need to be used here?
The Github link states it is "fully pipelined". A pipelined design employs a clock signal. One benefit of a pipelined design is that it can generally achieve higher speeds than using combinational logic alone. If the combinational logic is large, then the critical timing path through the design can be large, thereby limiting the speed.
I'm unfamiliar with the algorithm, but the wiki link describes it as parallel, which lends itself to pipelining.
-
\$\begingroup\$ A pipelined design generally has higher throughput, but worse latency than a combinatorial design. The critical timing path through the combinatorial design will always be shorter than the equivalent latency through the pipeline. When designing combinatorial logic, pipelining can be achieved without a clock, in asynchronous fashion. Instead of flip-flops between the stages, there are delay elements that align the outputs of each stage in time, so that they enter the following stage at the same time, and so on. The only FFs are needed on the front and back of the network then. \$\endgroup\$Kuba hasn't forgotten Monica– Kuba hasn't forgotten Monica2024年01月14日 17:58:03 +00:00Commented Jan 14, 2024 at 17:58
-
\$\begingroup\$ That is not too easy to do in FPGAs since closed source tools have poor-to-none support of this, and open source tools would need modifications, perhaps significant. In an ASIC or on a breadboard it's no big deal to do though. \$\endgroup\$Kuba hasn't forgotten Monica– Kuba hasn't forgotten Monica2024年01月14日 17:58:46 +00:00Commented Jan 14, 2024 at 17:58
Explore related questions
See similar questions with these tags.