4
\$\begingroup\$

LatticeMico32 (LM32) is a royalty-free CPU that I use to study how a pipelined in-order CPU may be implemented.

One particular troublesome point I have trouble with is how the register file is implemented. On a pipelined CPU, you will normally have at least three memory accesses to the register file on a given clock cycle:

  • 2 reads for both operands for the execution units.
  • 1 write from writeback stage

LM32 provides three ways to implement the register file:

  • Block RAM inference where reads/writes have extra logic to avoid parallel reads/writes.
  • Block RAM inference with out-of-phase clocks which don't require extra logic.
  • Distributed RAM inference.

In practice, even with distributed RAM inference, I have seen both Xilinx ise and yosys infer a block RAM with in phase read and write clocks. In addition, I've seen both synthesizers infer and at least part of the extra logic that the lm32 explicitly includes for a positive-edge Block RAM register file.

The inferred extra logic enables transparent reads. I have pasted the code here for lm32's explicit implementation, but I know from experimentation that yosys generates effectively the same code to place the register file in block RAM on iCE40:

// Register file
`ifdef CFG_EBR_POSEDGE_REGISTER_FILE
 /*----------------------------------------------------------------------
 Register File is implemented using EBRs. There can be three accesses to
 the register file in each cycle: two reads and one write. On-chip block
 RAM has two read/write ports. To accomodate three accesses, two on-chip
 block RAMs are used (each register file "write" is made to both block
 RAMs).
 One limitation of the on-chip block RAMs is that one cannot perform a
 read and write to same location in a cycle (if this is done, then the
 data read out is indeterminate).
 ----------------------------------------------------------------------*/
 wire [31:0] regfile_data_0, regfile_data_1;
 reg [31:0] w_result_d;
 reg regfile_raw_0, regfile_raw_0_nxt;
 reg regfile_raw_1, regfile_raw_1_nxt;
 /*----------------------------------------------------------------------
 Check if read and write is being performed to same register in current
 cycle? This is done by comparing the read and write IDXs.
 ----------------------------------------------------------------------*/
 always @(reg_write_enable_q_w or write_idx_w or instruction_f)
 begin
 if (reg_write_enable_q_w
 && (write_idx_w == instruction_f[25:21]))
 regfile_raw_0_nxt = 1'b1;
 else
 regfile_raw_0_nxt = 1'b0;
 if (reg_write_enable_q_w
 && (write_idx_w == instruction_f[20:16]))
 regfile_raw_1_nxt = 1'b1;
 else
 regfile_raw_1_nxt = 1'b0;
 end
 /*----------------------------------------------------------------------
 Select latched (delayed) write value or data from register file. If
 read in previous cycle was performed to register written to in same
 cycle, then latched (delayed) write value is selected.
 ----------------------------------------------------------------------*/
 always @(regfile_raw_0 or w_result_d or regfile_data_0)
 if (regfile_raw_0)
 reg_data_live_0 = w_result_d;
 else
 reg_data_live_0 = regfile_data_0;
 /*----------------------------------------------------------------------
 Select latched (delayed) write value or data from register file. If
 read in previous cycle was performed to register written to in same
 cycle, then latched (delayed) write value is selected.
 ----------------------------------------------------------------------*/
 always @(regfile_raw_1 or w_result_d or regfile_data_1)
 if (regfile_raw_1)
 reg_data_live_1 = w_result_d;
 else
 reg_data_live_1 = regfile_data_1;
 /*----------------------------------------------------------------------
 Latch value written to register file
 ----------------------------------------------------------------------*/
 always @(posedge clk_i `CFG_RESET_SENSITIVITY)
 if (rst_i == `TRUE)
 begin
 regfile_raw_0 <= 1'b0;
 regfile_raw_1 <= 1'b0;
 w_result_d <= 32'b0;
 end
 else
 begin
 regfile_raw_0 <= regfile_raw_0_nxt;
 regfile_raw_1 <= regfile_raw_1_nxt;
 w_result_d <= w_result;
 end
// Two Block RAM instantiations follow to get 2 read/1 write port.

Transparent reads ensure that writes to the same address as a read from another port appear at the read port on the same clock edge as well (assume the read and write clocks are synchronous). The lm32 pipeline relies on the read ports immediately reflecting the written-back register value.

However, there is extra glue logic for dealing with a stall of the pipeline and I'm not certain what this code accomplishes, even after studying the CPU implementation in detail. I have commented the code below for convenience:

 ifdef CFG_EBR_POSEDGE_REGISTER_FILE
 // Buffer data read from register file, in case a stall occurs, and watch for
 // any writes to the modified registers
 always @(posedge clk_i `CFG_RESET_SENSITIVITY)
 begin
 if (rst_i == `TRUE)
 begin
 use_buf <= `FALSE;
 reg_data_buf_0 <= {`LM32_WORD_WIDTH{1'b0}};
 reg_data_buf_1 <= {`LM32_WORD_WIDTH{1'b0}};
 end
 else
 begin
 if (stall_d == `FALSE)
 use_buf <= `FALSE;
 else if (use_buf == `FALSE)
 begin
 // If we stall in the decode stage, unconditionally
 // buffer the register file values from the read ports.
 // They will be used instead when the stall ends.
 reg_data_buf_0 <= reg_data_live_0;
 reg_data_buf_1 <= reg_data_live_1;
 use_buf <= `TRUE;
 end
 if (reg_write_enable_q_w == `TRUE)
 // If either register's address matches the register
 // to be written back, replace the buffered read values.
 begin
 if (write_idx_w == read_idx_0_d)
 reg_data_buf_0 <= w_result;
 if (write_idx_w == read_idx_1_d)
 reg_data_buf_1 <= w_result;
 end
 end
end
endif

Why is this logic required, and only for in phase read/write clocks at that? Is this code similar to any other common idioms for dealing with reading the correct data from block RAM as implemented on FPGAs (i.e. similar to how synthesizers will infer transparent read/write code)?

I would have figured that during a stall of the decode stage of a RISC CPU, logic that ensures transparent reads would be enough to make sure the read ports have the correct data output when the stall ends. By the time a full clock cycle has passed after a simultaneous read/write has occurred to the same address on different ports, shouldn't the read ports' data output(s) have settled to the new value, so we only need to buffer the most immediate data written to the write port?

I've synthesized this CPU many times using the distributed RAM inference alone (inferred as block RAM), so either this logic is not required, or ise and yosys are capable of inferring the extra glue logic required.

Paebbels
3,9872 gold badges23 silver badges43 bronze badges
asked Jan 3, 2018 at 1:06
\$\endgroup\$

1 Answer 1

4
\$\begingroup\$

This has been unanswered for a day and I think I know why. If Verilog code becomes a bit bigger and complex it is very difficult to see all the temporal relations. Even if the user puts lots of comments in (You said you added the comments so I assume was not the case here) you find that you have to run the simulation to see how it all hangs together.
To find out why that code is needed, remove it and see where things go wrong.

Having said that, I a can think of a possible scenario.

  • If the register file is a synchronous memory the data-out is lagging by one cycle.
  • The addresses to the register file are not stopped immediately in a decoder stall.
  • The data coming out is lost during the stall so must be captured.

This is no easy to describe in words so here is a timing diagram of that possible scenario:

enter image description here

In cycle 2 the need for a stall is detected. For some reason the addresses can not be stopped.
Cycle 3 is our extra stall cycle. Now the stall has gotten to the address logic so it will stop.
In Cycle 4 we want to continue but the data 'M1' is lost. Unless we store it during the stall, use it in cycle 4 and in cycle 5 all is OK again .

Note that with an a-synchronous register file the problem does no occur.


As a side note: I don't agree with your comment "unconditionally buffer the register file values" It is not 'unconditionally' because the followed code "if (reg_write_enable_q_w ..." takes precedence. That means there is an implicit "if there is no write happening" condition.

answered Jan 4, 2018 at 11:36
\$\endgroup\$
1
  • \$\begingroup\$ I haven't had time to examine in appreciable detail, or create a test case with equivalent behavior and inferred block RAM, but your answer is correct in this case (except M0 needs to be preserved during the stall; it's not possible to stop Adrs from being updated to A1 even when the stall signal is received). Accepting and will add my own answer elaborating when I get the chance. \$\endgroup\$ Commented Jan 9, 2018 at 0:28

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.