Contact Sales

Enhancing RISC-V Embedded Processor Performance through Advanced Instruction Fusion

Carlos Basto, Revi Ofir

Oct 28, 2025 / 6 min read

Synopsys IP
Technical Bulletin

In-depth technical articles, white papers, videos, webinars, product announcements and more.

Introduction

Advanced Instruction Fusion in Synopsys ARC-V Processor introduces a novel mechanism for fusing common pairs of RISC-V instructions, aimed at improving processor pipeline efficiency, particularly for resource constrained embedded processors. It extends a single-issue in-order processor to support dual instruction issue by fusing instructions from different functional units. Importantly, this does not introduce new instructions but maintains full RISC-V compatibility and is software agnostic, ensuring seamless integration with existing software and hardware environments. By reducing pipeline overhead and simplifying instruction handling, Advanced Instruction Fusion delivers significant efficiency improvements for embedded processors. The approach also provides adaptable design principles and can be extended from dual to multi-instruction fusion options to benefit RISC-V processor implementations across the ecosystem.

Challenges in embedded RISC-V design

As embedded systems continue to evolve, designers face the growing challenge of balancing tight power and cost constraints with the need for higher performance and increasingly heterogeneous processing architectures. This shift is driven in part by the rapid expansion of edge AI, where more workloads are being pushed closer to the data source, demanding smarter, more capable embedded solutions. At the same time, the open-standard RISC-V architecture is gaining momentum, particularly in microcontroller units (MCUs), which are leading in adoption and shipment volumes. These processors must meet extreme power efficiency, safety and reliability standards while supporting complex workloads at the edge.

The RISC-V ISA was designed to be simple and modular, utilizing many simple instructions to minimize CPU power consumption and area. However, this instruction verbosity can introduce performance limitations, as complex operations require more cycles to execute.

While techniques like dual-issue, multi-issue and out-of-order execution can boost Instructions per Cycle (IPC) and performance, they often increase area requirements, posing challenges for resource-constrained embedded processors.

Instruction fusion is a well-known technique that exploits available hardware resources to increase instruction level parallelism (ILP) [1], [2]. Instruction fusion offers a way to enhance ILP and CPU performance with minimal area overhead, making it particularly beneficial for improving performance density in small, in-order processors.

This article describes a novel Advanced Instruction Fusion technique for fusing pairs of RISC-V instructions at the micro-architectural level. This technique captures the main efficiency benefits of a dual-issue processor, while maintaining RISC-V compatibility and avoiding the need for a separate pipeline.

What is Advanced Instruction Fusion?

Architectural fusion vs. micro-architectural fusion

Some ISAs fuse instructions at the architectural level, and some ISAs leave the option to fuse instructions at the micro-architectural/implementation level. Typical examples are load/store pair and load/store with auto-increment. In some ISAs (e.g.: ARM and ARC) these are fused at the architectural level, i.e.: these are performed by a single instruction. Other ISAs (e.g.: RISC-V) take a different approach by keeping architectural instructions simple and delegate to the implementation to perform fusion at the micro-architecture level.

The main advantages of microarchitectural fusion compared to architectural fusion are:

  • Microarchitectural fusion enables more aggressive optimizations, such as fusing load pairs even when their memory addresses are non-contiguous.
  • Binary compatibility across different ISA implementations is simplified, since simple or small processors are not required to implement instruction fusion.

Implementing fusion at the micro-architecture level requires the processor to have sufficient instruction fetch bandwidth. A simple RISC ISA (e.g.: RISC-V) is very verbose and therefore consumes more instruction fetch bandwidth than ISAs that perform instruction fusion at the architectural level.

Simple in-order single issue processors usually have an instruction fetch bandwidth no greater than 4 bytes per cycle. This imposes a severe limitation on micro-architectural fusion. Most fusion pairs would need to be 16-bit compressed instructions.

Therefore, the first step to exploit micro-architecture fusion in resource constrained embedded processors is to increase its instruction fetch bandwidth.

Implementation Details

Advanced Instruction Fusion in resource constrained RISC-V designs

A traditional fusion pair does not require additional read or write register file bandwidth. Just like any other RISC-V instruction, a fusion pair would read at most two source operands from the register file and produce at most one result. There are, however, fusion pair candidates that break this rule:

  • Load-double: When two loads are fused, two register-file write ports are needed
  • Store-double: When two stores are fused, three register-file read ports are needed (the stores have a common base address, but each store needs its own store data operand)
  • MAC: When a multiply and add are fused, three register-file read ports are needed

To take advantage of these advanced fused pairs (load-double, store-double, and MAC) requires additional hardware resources. More specifically: the register file should be able to provide three source operands, and the addition of a second register-file write port.

The advanced instruction fusion technique adds additional hardware resources, and it increases its utilization. It does so by leveraging the micro-architecture fusion framework to enable limited dual-issue capabilities on an in-order processor. With this approach, any two independent instructions that map to different functional units, require up to three source operands, and produce no more than two destination registers can be considered candidates for advanced fusion (dual-issue).

The instructions are fused in the front-end with pre-decoded information about opcode and register operand identifiers. The pre-decoded register operand identifiers are used to detect the absence of data dependencies between a pair of advanced fused instructions. The decoder is augmented to receive additional information about the fused instruction, but it is not duplicated. Each instruction of the fused pair is dispatched to its respective functional unit. The back-end of the processor is mostly agnostic to instruction fusion, except for the increment of the architectural PC and handling of exceptions triggered by fused instructions. There is no need to introduce a separate pipeline.

Figure 1 illustrates typical implementation of a RISC-V processor front end with Advanced Instruction Fusion. Some examples include the following instruction pairs: LOAD+ALU, LOAD+BR, LOAD+MPY, ST+BR, ST+ALU.

Performance Results

Pollack's Rule states that performance improvements from microarchitectural enhancements generally scale with the square root of increased complexity. Figure 2 shows that for the Synopsys ARC-V RMX, a RISC-V embedded processor with Advanced Instruction Fusion, the measured CoreMark/MHz indicates performance gains that scale linearly with silicon area, resulting in a greater performance benefit. Additionally, performance density improvements may be even more substantial, as advanced instruction fusion incur only a fixed area overhead.

Conclusion

Advanced Instruction Fusion presents an effective approach for enhancing processor pipeline efficiency in resource-constrained embedded systems by fusing common pairs of RISC-V instructions. By enabling dual instruction issue on single-issue in-order processors through the fusion of instructions from different functional units, this technique achieves notable performance gains without introducing new instructions or requiring software modifications, thus maintaining full RISC-V compatibility. The reduction in pipeline overhead and simplification of instruction handling result in significant efficiency improvements, while the adaptable design allows for future extensions to multi-instruction fusion. Overall, Advanced Instruction Fusion offers a practical and scalable solution to efficiently improve performance in RISC-V processor implementations throughout the ecosystem.

To learn about Synopsys ARC-V Processor IP, please visit our ARC-V Processor IP webpage.

Key Takeaways

Ideal for Embedded Systems: The technique is tailored for small, in-order processors where area and power constraints limit traditional multi-issue or out-of-order designs.
Performance Boost Without ISA Changes: Advanced Instruction Fusion improves IPC and pipeline efficiency without introducing new instructions or breaking RISC-V compatibility.
Microarchitectural Fusion Advantage: Fusion at the microarchitecture level allows more flexible and aggressive optimizations compared to architectural fusion.
Hardware-Efficient Dual Issue: Enables dual instruction issue by fusing instructions from different functional units, requiring modest hardware enhancements like additional register file ports.
Scalable Design: The fusion framework is adaptable and can be extended to support multi-instruction fusion, paving the way for broader adoption across the RISC-V ecosystem.

References

[1] Exploring Instruction Fusion Opportunities in General Purpose Processors; Sawan Singh, Arthur Perais, Alexandra Jimborean, Alberto Ros

[2] The Renewed Case for the Reduced Instruction Set Computer: Avoiding ISA Bloat with Macro-Op Fusion for RISC-V; Christopher Celio, Daniel Dabbelt, David A. Patterson Krste Asanović

Subscribe to the Synopsys IP Technical Bulletin

Includes in-depth technical articles, white papers, videos, upcoming webinars, product announcements and more.

Continue Reading

Datasheet

Synopsys ARC NPX6 NPU Family for AI/Neural Processing

Download Datasheet
Webinar

Addressing Real-Time Workloads in Automotive Applications with Efficient ARC-V Processors

Register Now
Webinar

Implementing High Performance Real-Time Designs Using Synopsys ARC Processor IP

Register Now

AltStyle によって変換されたページ (->オリジナル) /