I have the following VHDL function that multiples a given mxn matrix a
by a nx1 vector b
:
function matrix_multiply_by_vector(a: integer_matrix; b: integer_vector; m: integer; n: integer)
return integer_vector is variable c : integer_vector(m-1 downto 0) := (others => 0);
begin
for i in 0 to m-1 loop
for j in 0 to n-1 loop
c(i) := c(i) + (a(i,j) * b(j));
end loop;
end loop;
return c;
end matrix_multiply_by_vector;
It works well but what does this actually implement in hardware? Specifically, what I want to know is if it is smart enough to realize that it can parallelize the inner for loop, essentially computing a dot product for each row of the matrix. If not, what is the simplest (i.e. nice syntax) way to parallelize matrix-vector multiplication?
-
1\$\begingroup\$ If it wasn't, you would have to have some kind of memory and serially load all of the values and "execute" them pipeline style \$\endgroup\$Voltage Spike– Voltage Spike ♦2018年06月01日 17:41:49 +00:00Commented Jun 1, 2018 at 17:41
2 Answers 2
In 'hardware' (VHDL or Verilog) all loops are unrolled and executed in parallel.
Thus not only your inner loop, also your outer loop is unrolled.
That is also the reason why the loop size must be known at compile time. When the loop length is unknown the synthesis tool will complain.
It is a well known trap for beginners coming from a SW language. They try to convert:
int a,b,c;
c = 0;
while (a--)
c += b;
To VHDL/Verilog hardware. The problem is that it all works fine in simulation. But the synthesis tool needs to generate adders: c = b+b+b+b...b;
For that the tool needs to know how many adders to make. If a
is a constant fine! (Even if it is 4.000.000. It will run out of gates but it will try!)
But if a
is a variable it is lost.
-
\$\begingroup\$ In this case it's just multiplication, so a could just be the multiplicand and therefore be variable... \$\endgroup\$Harry Svensson– Harry Svensson2018年06月01日 16:54:35 +00:00Commented Jun 1, 2018 at 16:54
This code will parallelize both loops, since you haven't defined an event to control any subset of the processing. Loops just generate as much hardware as they need to generate the function; you need a PROCESS.
A process has a sensitivity list that tells VHDL (or the synthesizer) that the process is not invoked unless one of the nodes in the list changes. This can be used to synthesize latches, and expand beyond the realm of pure combinatorial implementation.