fpgacpu.org - Inner Loop Custom Datapaths

Inner Loop Custom Datapaths

Home

Supercomputers >>
<< Multis and fast unis

Usenet Postings
By Subject
By Date

FPGA CPUs
Why FPGA CPUs?
Homebuilt processors
Altera, Xilinx Announce
Soft cores
Porting lcc
32-bit RISC CPU
Superscalar FPGA CPUs
Java processors
Forth processors
Reimplementing Alto
Transputers
FPGA CPU Speeds
Synthesized CPUs
Register files
Register files (2)
Floating point
Using block RAM
Flex10K CPUs
Flex10KE CPUs

Multiprocessors
Multis and fast unis
Inner loop datapaths
Supercomputers

Systems-on-a-Chip
SoC On-Chip Buses
On-chip Memory
VGA controller
Small footprints

CNets
CNets and Datapaths
Generators vs. synthesis

FPGAs vs. Processors
CPUs vs. FPGAs
Emulating FPGAs
FPGAs as coprocessors
Regexps in FPGAs
Life in an FPGA
Maximum element

Miscellaneous
Floorplanning
Pushing on a rope
Virtex speculation
Rambus for FPGAs
3-D rendering
LFSR Design

Subject: Re: FPGA multiprocessors
Date: 07 Oct 1997 00:00:00 GMT
Newsgroups: comp.arch.fpga
Charles Sweeney <CharlesSweeney-@compuserve.com> wrote in article
<3438A7D6.2431@compuserve.com>...> Jan Gray wrote:>> Assuming careful floorplanning, it should be possible to place six
32-bit>> processor tiles, or twelve 16-bit processor tiles, in a single 56x56>> XC4085XL with space left over for interprocessor interconnect. Also
the>> number of processor tiles can be doubled if we eschew the I-cache and>> simplify the microarchitecture -- though performance would greatly
suffer.> > It's good to see you planning to take advantage of the parallelism> offered by FPGAs, but why constrain your software to have to run in a> particular microprocessor architecture? why not go further and compile> your programs directly into the hardware of the FPGA, Handel-C does> exactly that, please see our web site below.
Good question.
The trite answer is since designing processor ISAs and microarchitectures
for FPGA implementations is my research interest, that's my hammer
in search of nails. FPGA multiprocessors are now possible -- but it
remains to be seen if they are actually useful!
The other answer is that I don't preclude a modest custom
datapath per processor (and such datapaths could be designed
from source code by tools such as Handel-C). So I think an FPGA
multiprocessor is the preferred solution for problems which:
1. are amenable to n-way "outer loop" parallelism and
2. involve too much irregular computation for custom datapath only and
3. involve enough inner loop regular computation that an FPGA
custom datapath is faster/cheaper than a general purpose processor
or multiprocessor built of same.
(Whether such problems exist and are important remains to be seen.)
As for your question "why not go further and compile your
programs directly into the hardware of the FPGA?" :-
There will always be very regular signal processing applications,
regular in computation, regular in operand fetch and result store,
and relatively simple in the computation kernel, for which a custom
datapath compiled to an FPGA is a good solution.
But there are also other computations which are either
too irregular or too large to practically implement in an FPGA
datapath, even in a time-multiplexed (reconfiguration) manner.
The "outer loops" and "outer function calls" of these
computations are best done in a general purpose processor,
even as you move the inner loop(s) to a custom datapath.
Indeed, the inner loops may constitute only a few percent
of the total text of the source code of the computation.
To help these large "dusty deck" applications take advantage
of custom datapaths, it must be extremely convenient to
interface the custom stuff to the general purpose processor.
For some problems where even the irregular computation
is a critical path, especially those involving floating-point,
it probably makes sense to choose a fast, cheap
commercial off-the-shelf microprocessor.
Of course there are penalties here. Cost of processor.
Less integration. Board real-estate costs. "Representation
domain crossing" costs. Relatively slow communication
between processor and FPGA. Cost of FPGA resources
spent interfacing to processor.
But for problems where the irregular computation is
not the critical path, the now modest overhead (10-20%)
of an embedded general purpose CPU enables an
interesting integrated "system on chip" hybrid:
embedded processor, on-chip bus, on-chip custom
datapaths and peripherals.
In theory, you could compile your dusty deck C, C++,
Java, FORTRAN, Scheme, etc. and run it immediately
on your FPGA CPU. Then automatically (profile driven)
or through explicit directives, you can compile the inner
loops to a custom datapath. This can either be manifest
as an on-chip command oriented coprocessor, or in some
cases as new instructions. The latter has the potential
advantage of very high custom operation issue rates
(today, 66 MHz) and access to processor register
file, etc.
Given this approach, even if your dusty deck app stores
its data in such advanced data structures (sarcasm)
as a linked list (/sarcasm), it can still potentially take
advantage of a custom datapath. This is much less
feasible if your registers or operands(s) are microseconds
away on the non-embedded host processor.
For example, the unused logic in
 //www3.sympatico.ca/jsgray/sld021.htm
was reserved for the Gouraud rendering instructions described 
in the last paragraph in:
 //www3.sympatico.ca/jsgray/render.txt
Of course, embedded processor in programmable logic is just
one point on the CPU/custom datapath spectrum. See also
the BRASS research
 //http.cs.berkeley.edu/Research/Projects/brass
and my old essay on FPGA PC coprocessors
 //www3.sympatico.ca/jsgray/coproc.txt
Jan Gray