5
\$\begingroup\$

I'm trying to learn FPGA programming, my test project is a 5 stage pipelined MIPS CPU, which works.

Up until now I have been optimising for area utilisation, however this has caused a very slow clock speed (~50MHz).

I have been looking at the post map static timing report generated by ISE, but can't make a lot of sense of it. Below is the section for a single path (the slowest), I can't understand why this path would be so slow.

My questions:

1) If the timing delay is 80% routing, (as this report seems to indicate) can I improve this? If so, how?

2) How can I reduce the logic component of the timing.

3) What is meant by "source" and "destination", in the below example, opcode_out[1] is the source and finished[0] is the destination, however in my design these are never directly connected. One is set in the negative edge of the decode stage, the other is set in the positive edge of the execute stage.

4) In some places I have played with using non blocking assignments, this is not possible everywhere. What performance effects does this have? I've found mixed reports on this.

5) Finally, what is the likelihood of me getting my clock speed to 200MHz, given that it is currently struggling to reach 50Mhz?

Paths for end point XLXI_30/XLXI_5/XLXI_3/execute_inst/finished_1 (OLOGIC_X2Y2.D1), 131 paths 
 -------------------------------------------------------------------------------- 
 Slack (setup path): -6.906ns (requirement - (data path - clock path skew + uncertainty)) 
 Source: XLXI_30/XLXI_5/XLXI_3/decode_inst/opcode_out_2 (FF) 
 Destination: XLXI_30/XLXI_5/XLXI_3/execute_inst/finished_1 (FF) 
 Requirement: 5.000ns 
 Data Path Delay: 11.871ns (Levels of Logic = 6) 
 Clock Path Skew: 0.000ns 
 Source Clock: XLXN_200 falling at 5.000ns 
 Destination Clock: XLXN_200 rising at 10.000ns 
 Clock Uncertainty: 0.035ns 
 Clock Uncertainty: 0.035ns ((TSJ^2 + TIJ^2)^1/2 + DJ) / 2 + PE 
 Total System Jitter (TSJ): 0.070ns 
 Total Input Jitter (TIJ): 0.000ns 
 Discrete Jitter (DJ): 0.000ns 
 Phase Error (PE): 0.000ns 
 Maximum Data Path at Slow Process Corner: XLXI_30/XLXI_5/XLXI_3/decode_inst/opcode_out_2 to XLXI_30/XLXI_5/XLXI_3/execute_inst/finished_1 
 Location Delay type Delay(ns) Physical Resource 
 Logical Resource(s) 
 ------------------------------------------------- ------------------- 
 SLICE_X40Y70.DQ Tcko 0.408 XLXI_30/XLXI_5/XLXI_3/decode_inst/opcode_out<2> 
 XLXI_30/XLXI_5/XLXI_3/decode_inst/opcode_out_2 
 SLICE_X41Y70.D5 net (fanout=8) e 0.759 XLXI_30/XLXI_5/XLXI_3/decode_inst/opcode_out<2> 
 SLICE_X41Y70.DMUX Tilo 0.313 XLXI_30/XLXI_5/XLXI_3/execute_inst/_n0402<5>1_FRB 
 XLXI_30/XLXI_5/XLXI_3/execute_inst/Mmux_func_in[5]_PWR_69_o_mux_83_OUT22_SW0 
 SLICE_X39Y70.B5 net (fanout=1) e 0.377 N118 
 SLICE_X39Y70.B Tilo 0.259 XLXI_30/XLXI_5/XLXI_3/execute_inst/rd_value_out_wire<31> 
 XLXI_30/XLXI_5/XLXI_3/execute_inst/Mmux_func_in[5]_PWR_69_o_mux_83_OUT22 
 SLICE_X41Y72.A6 net (fanout=3) e 0.520 XLXI_30/XLXI_5/XLXI_3/execute_inst/Mmux_func_in[5]_PWR_69_o_mux_83_OUT22 
 SLICE_X41Y72.A Tilo 0.259 XLXI_30/XLXI_5/XLXI_3/decode_inst/immediate_out<4> 
 XLXI_30/XLXI_5/XLXI_3/execute_inst/GND_74_o_GND_74_o_equal_146_o<5>1 
 SLICE_X41Y72.C5 net (fanout=19) e 0.547 XLXI_30/XLXI_5/XLXI_3/execute_inst/GND_74_o_GND_74_o_equal_146_o<5>1 
 SLICE_X41Y72.C Tilo 0.259 XLXI_30/XLXI_5/XLXI_3/decode_inst/immediate_out<4> 
 XLXI_30/XLXI_5/XLXI_3/execute_inst/func_in[5]_PWR_69_o_equal_125_o<5>1 
 SLICE_X31Y70.A3 net (fanout=23) e 0.934 XLXI_30/XLXI_5/XLXI_3/execute_inst/func_in[5]_PWR_69_o_equal_125_o 
 SLICE_X31Y70.A Tilo 0.259 XLXI_30/XLXI_5/XLXI_3/decode_inst/rd_out<3> 
 XLXI_30/XLXI_5/XLXI_3/execute_inst/_n0453 
 SLICE_X31Y70.B5 net (fanout=1) e 0.359 XLXI_30/XLXI_5/XLXI_3/execute_inst/_n0453 
 SLICE_X31Y70.B Tilo 0.259 XLXI_30/XLXI_5/XLXI_3/decode_inst/rd_out<3> 
 XLXI_30/XLXI_5/XLXI_3/execute_inst/finished_glue_set 
 OLOGIC_X2Y2.D1 net (fanout=2) e 5.556 XLXI_30/XLXI_5/XLXI_3/execute_inst/finished_glue_set 
 OLOGIC_X2Y2.CLK0 Todck 0.803 XLXI_30/XLXI_5/XLXI_3/execute_inst/finished_1 
 XLXI_30/XLXI_5/XLXI_3/execute_inst/finished_1 
 ------------------------------------------------- --------------------------- 
 Total 11.871ns (2.819ns logic, 9.052ns route) 
 (23.7% logic, 76.3% route) 
asked Jun 25, 2015 at 15:54
\$\endgroup\$
2
  • \$\begingroup\$ Don't use the schematic editor. We're living in the 21st century... \$\endgroup\$ Commented Jun 26, 2015 at 1:11
  • \$\begingroup\$ Making a complex design at 200MHz, is going to be tough, you mgut be able to do this if you pipeline your data, create CLEAR boundaries between blocks and modules, lower your logic levels to 4 or 5, and use register duplication and retiming option of t he synsthesis tool (it helps a little, but not a factor 4 as you want to achieve). \$\endgroup\$ Commented Jun 26, 2015 at 9:18

2 Answers 2

5
\$\begingroup\$

1) Routing is always the dominant factor limiting timing. That is why a carry-lookahead is not really faster in FPGA, as the larger adder requires more delays which partly overcome the advantages. Your path has 6 level of logic, which is O.K. It would be very hard to put all paths below 6. However, some nets have a high fanout, which yields longer delays. You can try to duplicate some registers to cut the fanout, or try the Xilinx options.

2) By changing the logic equation... Delay in a slice is only affected by the path it takes, not the logic operation. Of course, the path it takes is dictated by the logic operation. However, You can see in the timing report that it takes around 0.250 to 0.300 ns per slice (plus the routing delays...).

3) Source and destination are exactly what they say. There is a path between opcode and finished. At the falling edge of clk, opcode becomes valid and it's new value propagates in your circuit. The path ends at the finished register and the propagation has to stabilize before the rising edge of clk to meet timings.

4) It has no influence if you use them appropriately. It's should be a coding choice to make the code easier to read and understand, but the two can describe the same circuit. Problems arise when people use them without understanding the impact.

5) What's your FPGA? If it's series 7, it will be hard but possible, otherwise, no way. Also, don't augment the clock drastically. When the constraints are too high, Xilinx freaks out and the results are untrusty. A design that works at 10ns with a slack of 1ns may fails with -2ns at a clock of 11ns. There is a breaking point where the synthesizer try too hard to meet timings, and fails drastically when trying to place-and-route the bigger design.

I would also suggest you remove the DDR clock. There is no reason to have DDR logic in a processor, use a twice as fast clock instead. Having DDR adds unnecessary constraints on what slice can contains which logic, and probably inflate your routing delays. By using a single clock, the placement will (hopefully) use the optimal slice for all registers.

answered Jun 25, 2015 at 18:11
\$\endgroup\$
5
  • \$\begingroup\$ Switching to an SDR clock from a DDR clock would also remove any duty cycle dependence \$\endgroup\$ Commented Jun 25, 2015 at 22:14
  • \$\begingroup\$ Thanks for the detailed answer, a couple of small follow up questions/answers: I'm using a Spartan 6, I got the timing down fro 23ns to 16ns by refactoring the execute module. Their are only about 12 paths remaining that have poor timing now, all between decode/execute and cache/cache. What is "high fanout"? You mention register duplication to fix this, can you provide an example? What suggests I have a DDR clock in the design? I do, but only in the interface between the FPGA and main memory (its a requirement) it shouldn't appear at all in the "CPU". \$\endgroup\$ Commented Jun 26, 2015 at 17:52
  • \$\begingroup\$ In the report, it tells "Source clock falling at 5ns" and "Destination clock rising at 10ns", thus defining a DDR path. A SDR path would only have rising to rising. The fanout of a net is how many connection that net has. If your fanout is 20, you can duplicate the driving register to get 2 times a fanout of 10. This can be difficult to achieve since the tool often remove duplicate registers or do duplication automatically. When doing timing, it's important to know if it's worth your time, if 2 path are problematic, yes. If 20 paths are problematic, you will have a much harder time improving. \$\endgroup\$ Commented Jun 26, 2015 at 18:08
  • \$\begingroup\$ In regards to DDR, doesnt it just mean that a timing constraint is failing between a falling and rising edge, as opposed to a timing constraint that fails between a rising and falling edge? \$\endgroup\$ Commented Jun 26, 2015 at 18:20
  • \$\begingroup\$ SDR doesn't have paths between rising and falling edges (and vice-versa). SDR only have paths between successive rising edges or falling edges, for example "Source clock rising at 0ns" and "Destination clock rising at 10ns". Whenever you see rising to falling or the other way around, it's a DDR path. \$\endgroup\$ Commented Jun 26, 2015 at 18:23
3
\$\begingroup\$

This is only a partial answer, expanding on some of the points made by Jonathan Drolet.

3) What is meant by "source" and "destination", in the below example, opcode_out[1] is the source and finished[0] is the destination, however in my design these are never directly connected. One is set in the negative edge of the decode stage, the other is set in the positive edge of the execute stage.

Each signal in a synchronous design is generated by one flip-flop. Therefore, the convention is to label each FF with the signal that it generates. The timing analyzer checks every pair of FFs for which there is a logic path between them. So, in this particular case, it's talking about the path from the output of the FF that generates opcode_out[1] to the input of the FF that generates finished[0].

It can get confusing in complex designs, because sometimes the synthesis tool will replicate FFs or logic in order to deal with fanout issues, and this creates signals with names that don't exactly match the names in the source code. As you get deeper into this, you'll see what I mean.

4) In some places I have played with using non blocking assignments, this is not possible everywhere. What performance effects does this have? I've found mixed reports on this.

The type of assignment has no effect on the timing of the synthesized logic, which only depends on how the assignments are triggered by events such as clock edges. However, the type of assignment can change the functionality of the code, so be sure it's doing what you want in every case.

5) Finally, what is the likelihood of me getting my clock speed to 200MHz, given that it is currently struggling to reach 50Mhz?

The only way to get a 4× speedup is to carefully rearchitect your implementation, using many more pipeline stages and doing a lot less logic per stage.

In the example you show, you have six levels of logic in the path being analyzed. You may need to limit that to two or three levels of logic per path, but this has major implications on how the design works overall. I once had to implement a long binary adder so that it had no more than 3 levels of logic per pipeline stage. It's doable, but you really need to pay attention to the details.

answered Jun 26, 2015 at 14:15
\$\endgroup\$
1
  • \$\begingroup\$ Thanks for the answer - I've made some changes and now my design is faster (from 23ns to 16ns). The new "slowest path" has 12 logic levels - so I guess I know where I need to focus my attention. \$\endgroup\$ Commented Jun 26, 2015 at 17:54

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.