fpgacpu.org - Floating Point

Floating Point

Home

Using block RAM >>
<< Register files (2)

Usenet Postings
By Subject
By Date

FPGA CPUs
Why FPGA CPUs?
Homebuilt processors
Altera, Xilinx Announce
Soft cores
Porting lcc
32-bit RISC CPU
Superscalar FPGA CPUs
Java processors
Forth processors
Reimplementing Alto
Transputers
FPGA CPU Speeds
Synthesized CPUs
Register files
Register files (2)
Floating point
Using block RAM
Flex10K CPUs
Flex10KE CPUs

Multiprocessors
Multis and fast unis
Inner loop datapaths
Supercomputers

Systems-on-a-Chip
SoC On-Chip Buses
On-chip Memory
VGA controller
Small footprints

CNets
CNets and Datapaths
Generators vs. synthesis

FPGAs vs. Processors
CPUs vs. FPGAs
Emulating FPGAs
FPGAs as coprocessors
Regexps in FPGAs
Life in an FPGA
Maximum element

Miscellaneous
Floorplanning
Pushing on a rope
Virtex speculation
Rambus for FPGAs
3-D rendering
LFSR Design

Subject: Re: Floating point on fpga, and serial FP adders
Newsgroups: comp.arch.fpga
Date: Sat, 3 Jul 1999 10:43:33 -0700
Roland Paterson-Jones wrote in message <377dc508.d5f1d048@bigfoot.com>...>It has been variously stated that fpga's are no good for floating point>operations. Why? As I see it, floating point operations are typically>just shifted integer operations. Is the bit-width problematic?
For 16-bit floats with (say) 10 bit mantissas, FPGAs should be *great* for
floating point. Indeed problems start with the wider bit-widths. The
area-expensive (and worse than linear scaling) FP components are the barrel
shifters needed for pre-add mantissa operand alignment and post-add
normalization in the FP adder, and of course the FP multiplier array.
The FCCM papers on this subject include:
Ligon et al, A Re-evaluation of the practicality of floating-point
operations on FPGAs, FCCM 1998
Louca et al, Implementation of IEEE single precision floating point addition
and multiplication on FPGAs, FCCM 1996
Shirazi et al, Quantitative analysis of floating point arithmetic on FPGA
based custom computing machines, FCCM 1995
and the neat Leong paper on avoiding the problem entirely,
Leong et al, Automating floating to fixed point translation and its
application to post-rendering 3D warping, FCCM 1999
See the Ligon paper for a nice presentation of speed-area tradeoffs of
various implementation choices. Ligon estimates their single-precision FP
adder resource use at between 563 and 629 LUTs -- 36-40% of a XC4020E. Note
this group used a synthesis tool; a hand-mapped design could be smaller.
Put another way, that single precision FP adder is almost twice the area of
a pipelined 32-bit RISC datapath. Ouch.
The rest of this article explores ideas for slower-but-smaller FP adders.
The two FP add barrel shifters are the problem. They each need many LUTs
and much interconnect. For example, a w-bit-wide barrel shifter is often
implemented as lg w stages of w-bit 2-1 muxes, optionally pipelined.
Example 1: single-precision in << s, w=24 m0 = s[0] ? in[22:0] << 1 : in; m1 = s[1] ? m0[21:0] << 2: m0; m2 = s[2] ? m1[19:0] << 4 : m1; m3 = s[3] ? m2[15:0] << 8 : m2; // 16 wires 8 high out = s[4] ? m3[7:0] << 16 : m3; // 8 wires 16 high ---- 5*24 2-1 muxes = 120 LUTs Example 2: double-precision in << s, w=53 m0 = s[0] ? in[51:0] << 1 : in; m1 = s[1] ? m0[50:0] << 2: m0; m2 = s[2] ? m1[48:0] << 4 : m1; m3 = s[3] ? m2[44:0] << 8 : m2; // 45 wires 8 high m4 = s[4] ? m3[36:0] << 16 : m3; // 37 wires 16 high out = s[5] ? m4[20:0] << 32 : m4; // 21 wires 32 high ---- 6*53 2-1 muxes = 318 LUTs In a horizontally oriented datapath, the last few mux stages have many vertical wires, each many LUTs high. This is more vertical interconnect than is available in one column of LUTs/CLBs, so the actual area can be much worse than the LUT count indicates. BUT we can of course avoid the barrel shifters, and do FP denormalization/renormalization iteratively. Idea #1: Replace the barrel shifters with early-out iterative shifters. For example, build a registered 4-1 mux: w = mux(in, w<<1, w<<3, w<<7). Then an arbitrary 24-bit shift can be done in 5 cycles or less in ~1/3 of the area. For double precision, make it something like w = mux(in, w<<1, w<<4, w<<12), giving an arbitrary 53-bit shift in 8 cycles. Idea #2: (half baked and sketchy) Do FP addition in a bit- or nibble-serial fashion. To add A+B, you 1) compare exponents A.exp and B.exp; 2) serialize A.mant and B.mant, LSB first; 3) swap (using 2 2-1 muxes) lsb-serial(A.mant) and lsb-serial(B.mant) if A.exp < B.exp 4) delay lsb-serial(A.mant) in a w-bit FIFO for abs(A.exp-B.exp) cycles; 5) bit-serial-add delay(lsb-serial(A.mant)) + lsb-serial(B.mant) for w cycles 6) collect in a "sum.mant" shift register 7) shift up to w-1 cycles (until result mantissa is normalized). It may be that steps 4 and 6 are quite cheap, using Virtex 4-LUTs in shift register mode -- they're variable tap, right? It is interesting to consider eliminating steps 2, 6, and 7, by keeping your FP mantissa values in the serialized representation between operations, counting clocks since last sum-1-bit seen, and then normalizing (exponent adjustment only) and aligning *both* operands (via swap/delay) on input to the next FP operation. A big chained data computation might exploit many serially interconnected serial FP adders and serial FP multipliers... Is this approach better (throughput/area) than a traditional pipelined word-oriented FP datapath? Probably not, I don't know. But if your FP needs are modest (Mflops not 100 Mflops) this approach should permit quite compact FP hardware. (Philip Freidin and I discussed this at FCCM99. Thanks Philip.) Jan Gray