I was trying to induce the following functionality into SystemVerilog but i cant think of any efficient ways:
So above is a picture of two 6-bit input packets that come one after the other (triggered by a clock edge). Out of that i need to deliver three 4-bit output packets.
These output packets need to be sent next after the other. So from the first input packet i need to send out 4-bit output packet, store the last 2 bits of the packet and concatenate it with the first 2 bits from the next package. Finally, the last 4-bits are then sent as one 4-bit packet.
N.B: This is an example. The Input packets number can go upto like 1000 and its width can change which would mean the 'residue' or left out bits would change.
So please if anyone can guide me how to approach this problem, i'll be really grateful.
Edit: I think i've confused everyone. The problem is quite complex so i didn't include everything for the sake of simplicity. Here it goes:
The module i'm trying to make has a higher input parallelism as compared to the output which means that the data going inside the module is greater than data coming out. This would mean an accumulation of more data inside the module with every transaction.
The input packets would be arriving continuously. Each input packets lasts one clock cycle. The requirement is to push output packets (smaller than input packets) continuously aswell. The output packet should also last one clock cycle. The N.B sentence i've added is related to parameterization in which i can choose an arbitrary number of input and output parallelism. This is just to inform you the 6-bit input and 4-bit output packets are just an illustration. It could clearly be 16 bits input and 10-bit output.
This would clearly be done through a state machine but i don't know how i can concatenate the bits from the previous messages with the incoming messages and maintain a continuous flow of output packets with each clock cycle. Hopefully this clears everything.
-
\$\begingroup\$ How fast are the data clocks? What is the relationship between the clock that loads the input packets vs the clock that updates the output packets? Are input packets coming in continuously? Does the packet size change during operation or is it fixed at design time? What is receiving the packets that are output by this block? \$\endgroup\$Elliot Alderson– Elliot Alderson2019年08月09日 16:22:25 +00:00Commented Aug 9, 2019 at 16:22
-
\$\begingroup\$ The system is synchronous (same clock for everything). Yeah Input packets are coming continuously. Packet size is fixed during operation. \$\endgroup\$Ashhad Khan– Ashhad Khan2019年08月09日 16:27:30 +00:00Commented Aug 9, 2019 at 16:27
-
\$\begingroup\$ Gonna need some more concrete details. This is on the bit level? The packets are an arbitrary number of bits in length? What are the input and output interfaces? If the output is narrower than the input, is the output clock fast enough to transfer one per cycle, or do you need to transition to a faster clo k? \$\endgroup\$alex.forencich– alex.forencich2019年08月09日 17:15:16 +00:00Commented Aug 9, 2019 at 17:15
-
\$\begingroup\$ How can you take two packets in and write three packets out on the same clock? Your description does not make sense. \$\endgroup\$Elliot Alderson– Elliot Alderson2019年08月09日 17:33:40 +00:00Commented Aug 9, 2019 at 17:33
-
\$\begingroup\$ Please elaborate on the "its width can change" section. Apart from that it is a trivial problem. \$\endgroup\$Oldfart– Oldfart2019年08月09日 20:11:58 +00:00Commented Aug 9, 2019 at 20:11
1 Answer 1
What you need is a FIFO with different input and output bit widths. This can be achieved with an array using two index pointers and a register keeping track of the numbers stored bits. The pointers will wrap around. The bit with of the stored bits need to be a common multiple of the input and output widths.
Here is some SystemVerilog code to get you started. Not fully tested or optimized, and I omitted the logic for empty / full / error (overflow) for you to figure out.
always_ff @(posedge clk) begin
if (!rst_n) begin
in_idx <= '0; // input pointer
out_idx <= '0; // output pointer
bitcnt <= '0; // number of stored bits
out_vld <= 1'b0;
end
else begin
bitcnt <= next_bitcnt;
if (in_vld) begin
store[ in_idx +: 6] <= in;
in_idx <= (in_idx + 6) % STORE_BITS;
end
out_vld <= (bitcnt >= 4 || in_vld);
if (out_vld) begin
out_idx <= (out_idx + 4) % STORE_BITS;
end
end
end
always_comb begin
next_bitcnt = bitcnt;
if (in_vld) begin
next_bitcnt += 6;
end
if (bitcnt >= 4 || in_vld) begin
next_bitcnt -= 4;
end
end
assign out = store[ out_idx +: 4 ];
-
\$\begingroup\$ +1, also if the buffer is very large you might get better area/performance with a shift register shifting by 4b each time to avoid a big combinational mux? The 6b input could probably still use bit indexing, though the index increment would be different. \$\endgroup\$Justin– Justin2019年08月20日 20:52:06 +00:00Commented Aug 20, 2019 at 20:52
-
\$\begingroup\$ As I said, it is not optimized. Considering I suspect this is a class assignment I'm intentionally leaving certain things out. I tried to synthesize my code with Yosys (v0.3.0) on EDAplayground which couldn't optimize
store[ in_idx +: 6] <= in;
(generated logic forin_idx
equal to 0,1,2,3,4,... even though only 0,6,12,... are reachable). Had to add a for-loop and if-condition get better synthesis results. \$\endgroup\$Greg– Greg2019年08月21日 21:23:23 +00:00Commented Aug 21, 2019 at 21:23