All routines except SYMV call the GEMV kernel in the same fashion.
Other than SYMV, all routines cannot reduce the load of
$A$ from
[画像:$O(N^2)$],
but can reduce the memory access of both
$X$ and
$Y$ from
[画像:$O(N^2)$] to
$O(N)$. In general, the
$Y$ access is reduced by register blocking in
the GEMV kernel. Therefore, the higher level routines block
$X$ such
it is reused across kernel invocations in L1 (if you write a axpy-based
notranspose GEMV kernel,
$Y$ is blocked instead of
$X$). What this amounts
to is partitioning
$X$ via:
[画像:$N_p = \frac{S_1 - R_y}{R_y+1}$], where
$S_1$ is the size, in elements,
of the Level 1 cache,
$N_p$ is the partitioning of
$X$, and
$R_y$ corresponds
to the
Yunroll of your input file. The equation is actually a little
more complicated than this, as ATLAS may want to use less than the full
$S_1$ to avoid cache thrashing and throwing useful sections away between
kernel calls. However, this gives the user some idea of the importance of
these parameters. In particular, it shows that
Yunroll should not
be allowed to grow too large, for fear of causing the
$X$ loop to be
too short to support good optimization.
Also, note that after the first invocation of the kernel, $X$ will come
from L1, leaving $A$ the dominant data cost.
At present, SYMV does a different blocking that blocks both $X$ and
$Y$ (all other routines block only one dimension), so that $A$ is reused
between calls to the Transpose and NoTranspose kernels. This may
eventually change as greater sophistication is achieved (as you might
imagine, you get two very different GEMV kernels if one is expecting
$A$ from main memory, and the other expects $A$ to come from L1, as
in this case; this means we may at some time generate a specialized
L1-contained GEMV kernel).
Note that the $\beta = 0$ case must not read $Y$, since the memory may
legally be unitialized.
Clint Whaley
2012年07月10日