Fast integer matrix multiplication with bit-twiddling hacks

Question 1

I am asking if it is possible to improve considerably integer matrix multiplication with bitwise operations. The matrices are small, and the elements are small nonnegative integers (small means at most 20).

To keep us focused, let's be extremely specific, and say that I have two 3x3 matrices, with integer entries 0<=x<15.

The following naive C++ implementation executed a million times performs around 1s, measured with linux time.

#include <random>
int main() {
//Random number generator
std::random_device rd;
std::mt19937 eng(rd());
std::uniform_int_distribution<> distr(0, 15);
int A[3][3];
int B[3][3];
int C[3][3];
for (int trials = 0; trials <= 1000000; trials++) {
 //Set up A[] and B[]
 for (int i = 0; i < 3; ++i) {
 for (int j = 0; j < 3; ++j) {
 A[i][j] = distr(eng);
 B[i][j] = distr(eng);
 C[i][j] = 0;
 }
 }
 //Compute C[]=A[]*B[]
 for (int i = 0; i < 3; ++i) {
 for (int j = 0; j < 3; ++j) {
 for (int k = 0; k < 3; ++k) {
 C[i][j] = C[i][j] + A[i][k] * B[k][j];
 }
 }
 }
}
return 0;
}

Notes:

The matrices are not necessarily sparse.
Strassen-like comments does not help here.
Let's try not to use the circumstantial observation, that in this specific problem the matrices A[] and B[] can be encoded as a single 64 bit integer. Think of what would happen for just a bit larger matrices.
Computation is single-threaded.

Related: Binary matrix multiplication bit twiddling hack and What is the optimal algorithm for the game 2048?

Question 2

What proportion of the above is actually attributable to the multiply, as opposed to the randomisation?

Question 3

Fair comment. Changing the number of trials gives you a fair answer.

Question 4

But back to the actual question. I think the first (and probably only major) improvement would be to structure this to ensure that the compiler can vectorise it (alternatively, do explicit vectorisation).

Question 5

Look at using the hardware to do the work for you. MMX instructions and be careful about caching. This can make more of a difference than the algorithm choice.

Question 6

How did you compile the code? The level of optimization employed by the compiler (and linker) can make a pretty damn big difference.

Question 7

The question you linked is about a matrix where every element is a single bit. For one-bit values a and b, a * b is exactly equivalent to a & b.

For adding 2-bit elements, it might be plausible (and faster than unpacking) to add basically from scratch, with XOR (carryless-add), then generate the carry with AND, shift, and mask off carry across element boundaries.

A 3rd bit would require detecting when adding the carry produces yet another carry. I don't think it would be a win to emulating even a 3 bit adder or multiplier, compared to using SIMD. Without SIMD (i.e. in pure C with uint64_t) it might make sense. For add, you might try using a normal add and then try to undo the carry between element boundaries, instead of building an adder yourself out of XOR/AND/shift operations.

packed vs. unpacked-to-bytes storage formats

If you have very many of these tiny matrices, storing them in memory in compressed form (e.g. packed 4bit elements) can help with cache footprint / memory bandwidth. 4bit elements are fairly easy to unpack to having each element in a separate byte element of a vector.

Otherwise, store them with one matrix element per byte. From there, you can easily unpack them to 16bit or 32bit per element if needed, depending on what element sizes the target SIMD instruction set provides. You might keep some matrices in local variables in unpacked format to reuse across multiplies, but pack them back into 4bits per element for storage in an array.

Compilers suck at this with uint8_t in scalar C code for x86. See comments on @Richard's answer: gcc and clang both like to use mul r8 for uint8_t, which forces them to move data into eax (the implicit input/output for a one-operand multiply), rather than using imul r32, r32 and ignoring the garbage that leaves outside the low 8 bits of the destination register.

The uint8_t version actually runs slower than the uint16_t version, even though it has half the cache footprint.

You're probably going to get best results from some kind of SIMD.

Intel SSSE3 has a vector byte multiply, but only with adding of adjacent elements. Using it would require unpacking your matrix into a vector with some zeros between rows or something, so you don't get data from one row mixed with data from another row. Fortunately, pshufb can zero elements as well as copy them around.

More likely to be useful is SSE2 PMADDWD, if you unpack to each matrix element in a separate 16bit vector element. So given a row in one vector, and a transposed-column in another vector, pmaddwd (_mm_madd_epi16) is one horizontal add away from giving you the dot-product result you need for C[i][j].

Instead of doing each of those adds separately, you can probably pack multiple pmaddwd results into a single vector so you can store C[i][0..2] in one go.

Question 8

You may find that reducing the data size gives you a considerable performance improvement if you are performing this calculation over a large number of matrices:

#include <cstdint>
#include <cstdlib>
using T = std::uint_fast8_t;
void mpy(T A[3][3], T B[3][3], T C[3][3])
{
 for (int i = 0; i < 3; ++i) {
 for (int j = 0; j < 3; ++j) {
 for (int k = 0; k < 3; ++k) {
 C[i][j] = C[i][j] + A[i][k] * B[k][j];
 }
 }
 }
}

The pentium can move and sign-extend an 8-bit value in one instruction.
This means you're getting 4 times as many matrices per cache line.

UPDATE: curiosity piqued, I wrote a test:

#include <random>
#include <utility>
#include <algorithm>
#include <chrono>
#include <iostream>
#include <typeinfo>
template<class T>
struct matrix
{
 static constexpr std::size_t rows = 3;
 static constexpr std::size_t cols = 3;
 static constexpr std::size_t size() { return rows * cols; }
 
 template<class Engine, class U>
 matrix(Engine& engine, std::uniform_int_distribution<U>& dist)
 : matrix(std::make_index_sequence<size()>(), engine, dist)
 {}
 
 template<class U>
 matrix(std::initializer_list<U> li)
 : matrix(std::make_index_sequence<size()>(), li)
 {
 
 }
 
 matrix()
 : _data { 0 }
 {}
 
 const T* operator[](std::size_t i) const {
 return std::addressof(_data[i * cols]);
 }
 
 T* operator[](std::size_t i) {
 return std::addressof(_data[i * cols]);
 }
 
private:
 
 template<std::size_t...Is, class U, class Engine>
 matrix(std::index_sequence<Is...>, Engine& eng, std::uniform_int_distribution<U>& dist)
 : _data { (void(Is), dist(eng))... }
 {}
 
 template<std::size_t...Is, class U>
 matrix(std::index_sequence<Is...>, std::initializer_list<U> li)
 : _data { ((Is < li.size()) ? *(li.begin() + Is) : 0)... }
 {}
 
 
 T _data[rows * cols];
};
template<class T>
matrix<T> operator*(const matrix<T>& A, const matrix<T>& B)
{
 matrix<T> C;
 for (int i = 0; i < 3; ++i) {
 for (int j = 0; j < 3; ++j) {
 for (int k = 0; k < 3; ++k) {
 C[i][j] = C[i][j] + A[i][k] * B[k][j];
 }
 }
 }
 return C;
}
static constexpr std::size_t test_size = 1000000;
template<class T, class Engine>
void fill(std::vector<matrix<T>>& v, Engine& eng, std::uniform_int_distribution<T>& dist)
{
 v.clear();
 v.reserve(test_size);
 generate_n(std::back_inserter(v), test_size,
 [&] { return matrix<T>(eng, dist); });
}
template<class T>
void test(std::random_device& rd)
{
 std::mt19937 eng(rd());
 std::uniform_int_distribution<T> distr(0, 15);
 
 std::vector<matrix<T>> As, Bs, Cs;
 fill(As, eng, distr);
 fill(Bs, eng, distr);
 fill(Cs, eng, distr);
 
 auto start = std::chrono::high_resolution_clock::now();
 auto ia = As.cbegin();
 auto ib = Bs.cbegin();
 for (auto&m : Cs)
 {
 m = *ia++ * *ib++;
 }
 auto stop = std::chrono::high_resolution_clock::now();
 
 auto diff = stop - start;
 auto millis = std::chrono::duration_cast<std::chrono::microseconds>(diff).count();
 std::cout << "for type " << typeid(T).name() << " time is " << millis << "us" << std::endl;
 
}
int main() {
 //Random number generator
 std::random_device rd;
 test<std::uint64_t>(rd);
 test<std::uint32_t>(rd);
 test<std::uint16_t>(rd);
 test<std::uint8_t>(rd);
}

Example output (recent macbook pro, 64-bit, compiled with -O3):
for type y time is 32787 μs
for type j time is 15323 μs
for type t time is 14347 μs
for type h time is 31550 μs

Summary:
on this platform, int32 and int16 proved to be as fast as each other. int64 and int8 were equally slow (the 8-bit result surprised me).

Conclusion:
As ever, express intent to the compiler and let the optimiser do its thing. If the program is running too slowly in production, take measurements and optimise the worst-offenders.

Question 9

Given that these are small matrices (thus no individual matrix exceeds the cache, thus each element is loaded precisely once), this improvement is really just down to raw memory bandwidth, rather than matrices per cache line.

Question 10

@OliverCharlesworth fair point. However I think the wording either way works. Memory is burst-loaded in cache-line-sized blocks, and indeed is linearly prefetched (in most architectures). The performance increase is likely to be considerable. BTW I should have used an unisigned int8 in case of overflow. will update.

Question 11

Well, this is interesting. I removed the random stuff and set A[] and B[] to some deterministic value (the same value for all trials) and actually this new type decreased performance as execution time has risen from 0.150s to 0.175s.

Question 12

@Matsmath it's going to depend on whether you're doing the operation a million times on the same matrix, or once for each of a million matrices. Because of optimisation and (moreso) memory caching, performing the same operation on one matrix over and over will not give a useful benchmark.

Question 13

True. Perhaps I should precompute the data set in advance.

Question 14

Yes you can, with a bit of knowledge of linear algebra, integer representation, and with the help of hardware 64-bit integer multipliers.

If you want some better background for this method, please refer to a previous answer of mine which solves the exact same problem only over boolean algebra, and serves as a good starting point for your question:

Fast multiplication of k x k boolean matrices, where 8 <= k <= 16

In your case, instead of representing one 8x8 matrix of booleans per 64-bit word, you can only pack at most a 2x2 matrix of small positive ints of up to 7 bits, once accounting for the necessary zero padding to avoid carry spills.

I won't go into further detail, as the theory necessary to understand the bit trickery is extensively covered in every major textbook on the topic.

The function looks as follows:

uint64_t uint7matmul2x2x2 (uint64_t A, uint64_t B) {
 const uint64_t ROW = 0x00000000007F007F;
 const uint64_t COL = 0x0000007F0000007F;
 uint64_t C = 0;
 for (int i=0; i<2; ++i) {
 uint64_t p = COL & (A>>i*16);
 uint64_t r = ROW & (B>>i*32);
 C += (p*r);
 }
 return C;
}

In this case we pack 2x2=4 element products and two additions per int64 multiply, and the full 2x2x2 matrix product subblock in just two int64 products and one addition.

The packing and unpacking parts of the algorithm are a little more involved than in the boolean case, but relatively straightforward to figure out and implement if you understand how the core of the algorithm works.

Additionally, if your integers are, as you said, smaller than 32, you can accumulate multiple results of the partial matrix products by just adding up to 64 of them in the uint64 representation before having to unpack the partial results.

It might be possible to pack 2x3 = 6 multiplications per product by extending the same basic analysis a bit further. Perhaps even 3x3=9 or more than by use of group theory, but the transformations between representations would likely be way too costly to make it worth the performance gains in the product stage, as well as taking much more time and effort than I'm capable to do at the moment.

Question 15

With some work it should be possible to calculate ROW and COL at compile time given the matrix dimensions and size in bits of each element, as well as the number of loop iterations you need. But the hard part would then be packing/unpacking the values. Thanks to AI, nowadays you can get CPUs with dedicated matrix multiplication instructions that work with 8-bit integers. So that might be faster for matrices that no longer fit into 64-bits.

Question 16

@G.Sliepen Indeed. The "best" algorithm of this form is highly dependent on the value constraints. Either way, it's possible to use a more sophisticated form of encoding akin to NTT and other fast convolution techniques to compact the representation further, thus squeezing more products in the same multiplication. The immediate drawback is that these kind of encodings require extra multiplications and modulo divisions of their own, which may tank the overall performance. However, for large enough matrices, it may be worth paying the O(N^2) price if it reduces the O(N^3) prefactor.

Peter Cordes 377k50 gold badges745 silver badges1k bronze badges · Accepted Answer · 2016-05-10 00:06:22Z

The question you linked is about a matrix where every element is a single bit. For one-bit values a and b, a * b is exactly equivalent to a & b.

For adding 2-bit elements, it might be plausible (and faster than unpacking) to add basically from scratch, with XOR (carryless-add), then generate the carry with AND, shift, and mask off carry across element boundaries.

A 3rd bit would require detecting when adding the carry produces yet another carry. I don't think it would be a win to emulating even a 3 bit adder or multiplier, compared to using SIMD. Without SIMD (i.e. in pure C with uint64_t) it might make sense. For add, you might try using a normal add and then try to undo the carry between element boundaries, instead of building an adder yourself out of XOR/AND/shift operations.

packed vs. unpacked-to-bytes storage formats

If you have very many of these tiny matrices, storing them in memory in compressed form (e.g. packed 4bit elements) can help with cache footprint / memory bandwidth. 4bit elements are fairly easy to unpack to having each element in a separate byte element of a vector.

Otherwise, store them with one matrix element per byte. From there, you can easily unpack them to 16bit or 32bit per element if needed, depending on what element sizes the target SIMD instruction set provides. You might keep some matrices in local variables in unpacked format to reuse across multiplies, but pack them back into 4bits per element for storage in an array.

Compilers suck at this with uint8_t in scalar C code for x86. See comments on @Richard's answer: gcc and clang both like to use mul r8 for uint8_t, which forces them to move data into eax (the implicit input/output for a one-operand multiply), rather than using imul r32, r32 and ignoring the garbage that leaves outside the low 8 bits of the destination register.

The uint8_t version actually runs slower than the uint16_t version, even though it has half the cache footprint.

You're probably going to get best results from some kind of SIMD.

Intel SSSE3 has a vector byte multiply, but only with adding of adjacent elements. Using it would require unpacking your matrix into a vector with some zeros between rows or something, so you don't get data from one row mixed with data from another row. Fortunately, pshufb can zero elements as well as copy them around.

More likely to be useful is SSE2 PMADDWD, if you unpack to each matrix element in a separate 16bit vector element. So given a row in one vector, and a transposed-column in another vector, pmaddwd (_mm_madd_epi16) is one horizontal add away from giving you the dot-product result you need for C[i][j].

Instead of doing each of those adds separately, you can probably pack multiple pmaddwd results into a single vector so you can store C[i][0..2] in one go.

CollectivesTM on Stack Overflow

Fast integer matrix multiplication with bit-twiddling hacks

3 Answers 3

packed vs. unpacked-to-bytes storage formats

You're probably going to get best results from some kind of SIMD.

Comments

18 Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Hot Network Questions