Multi-threaded integer matrix multiplication in NumPy/SciPy

Question 1

Doing something like

import numpy as np
a = np.random.rand(10**4, 10**4)
b = np.dot(a, a)

uses multiple cores, and it runs nicely.

The elements in a, though, are 64-bit floats (or 32-bit in 32-bit platforms?), and I'd like to multiply 8-bit integer arrays. Trying the following, though:

a = np.random.randint(2, size=(n, n)).astype(np.int8)

results in the dot product not using multiple cores, and thus running ~1000x slower on my PC.

array: np.random.randint(2, size=shape).astype(dtype)
dtype shape %time (average)
float32 (2000, 2000) 62.5 ms
float32 (3000, 3000) 219 ms
float32 (4000, 4000) 328 ms
float32 (10000, 10000) 4.09 s
int8 (2000, 2000) 13 seconds
int8 (3000, 3000) 3min 26s
int8 (4000, 4000) 12min 20s
int8 (10000, 10000) It didn't finish in 6 hours
float16 (2000, 2000) 2min 25s
float16 (3000, 3000) Not tested
float16 (4000, 4000) Not tested
float16 (10000, 10000) Not tested

I understand NumPy uses BLAS, which doesn't support integers, but if I use the SciPy BLAS wrappers, ie.

import scipy.linalg.blas as blas
a = np.random.randint(2, size=(n, n)).astype(np.int8)
b = blas.sgemm(alpha=1.0, a=a, b=a)

the computation is multi-threaded. Now, blas.sgemm runs with exactly the same timing as np.dot for float32's, but for non-floats it converts everything to float32 and outputs floats, which is something np.dot doesn't do. (In addition, b is now in F_CONTIGUOUS order, which is a lesser issue).

So, if I want to do integer matrix multiplication, I have to do one of the following:

Use NumPy's painfully slow np.dot and be glad I get to keep the 8-bit integers.
Use SciPy's sgemm and use up 4x memory.
Use Numpy's np.float16 and only use up 2x memory, with the caveat that np.dot is much slower on float16 arrays than on float32 arrays, more so than int8.
Find an optimized library for multi-threaded integer matrix multiplication (actually, Mathematica does this, but I'd prefer a Python solution), ideally supporting 1-bit arrays, although 8-bit arrays is also fine... (I'm actually aiming to do multiplication of matrices over the finite field Z/2Z, and I know I can do this with Sage, which is quite Pythonic, but, again, is there something strictly Python?)

Can I follow option 4? Does such a library exist?

Disclaimer: I'm actually running NumPy + MKL, but I've tried a similar test on vanilly NumPy, with similar results.

Question 2

About your option n°4, maybe you could have a look on PyCuda or on Theano ? They allow large operations to be done on the GPU (with an easy interface with numpy) an quite good performances.

Question 3

As a possible answer to option 4, bitbucket.org/malb/m4ri looks interesting. "M4RI is a library for fast arithmetic with dense matrices over F2." I guess that's what Sage is already using, but I don't see any reason why you couldn't use it directly from Python, with a suitable Cython wrapper. (In fact, you might be able to find such a wrapper already in the Sage sources.)

Question 4

Nobody mentioned numpy.einsum yet, but that might be a good option 5.

Question 5

Note that you will need to cast the result to something bigger if you want to avoid integer overflow. If each element is either 0 or 1, you need an integer format that can hold values up to at least n in order to guarantee no overflow. For your example where n=10000, (u)int16 ought to be enough. Are your real matrices sparse, by any chance? If so, you would be much better off using scipy.sparse.csr_matrix.

Question 6

Could you give some more context for the overall problem you are trying to solve? Multiplying big integer matrices together is a rather unusual thing to do. It would be particularly useful to know more about the properties of these matrices. Are the values always either 0 or 1? If they can be larger then you may well find yourself ultimately constrained by the largest integer that can be represented using uint64. How are the matrices generated? Do they have any special structure (e.g. symmetry, blocks, bands etc.)?

Question 7

Note that while this answer gets old numpy might gain optimized integer support. Please verify if this answer still works faster on your setup.

Option 5 - Roll a custom solution: Partition the matrix product in a few sub-products and perform these in parallel. This can be relatively easy implemented with standard Python modules. The sub-products are computed with numpy.dot, which releases the global interpreter lock. Thus, it is possible to use threads which are relatively lightweight and can access the arrays from the main thread for memory efficiency.

Implementation:

import numpy as np
from numpy.testing import assert_array_equal
import threading
from time import time
def blockshaped(arr, nrows, ncols):
 """
 Return an array of shape (nrows, ncols, n, m) where
 n * nrows, m * ncols = arr.shape.
 This should be a view of the original array.
 """
 h, w = arr.shape
 n, m = h // nrows, w // ncols
 return arr.reshape(nrows, n, ncols, m).swapaxes(1, 2)
def do_dot(a, b, out):
 #np.dot(a, b, out) # does not work. maybe because out is not C-contiguous?
 out[:] = np.dot(a, b) # less efficient because the output is stored in a temporary array?
def pardot(a, b, nblocks, mblocks, dot_func=do_dot):
 """
 Return the matrix product a * b.
 The product is split into nblocks * mblocks partitions that are performed
 in parallel threads.
 """
 n_jobs = nblocks * mblocks
 print('running {} jobs in parallel'.format(n_jobs))
 out = np.empty((a.shape[0], b.shape[1]), dtype=a.dtype)
 out_blocks = blockshaped(out, nblocks, mblocks)
 a_blocks = blockshaped(a, nblocks, 1)
 b_blocks = blockshaped(b, 1, mblocks)
 threads = []
 for i in range(nblocks):
 for j in range(mblocks):
 th = threading.Thread(target=dot_func, 
 args=(a_blocks[i, 0, :, :], 
 b_blocks[0, j, :, :], 
 out_blocks[i, j, :, :]))
 th.start()
 threads.append(th)
 for th in threads:
 th.join()
 return out
if __name__ == '__main__':
 a = np.ones((4, 3), dtype=int)
 b = np.arange(18, dtype=int).reshape(3, 6)
 assert_array_equal(pardot(a, b, 2, 2), np.dot(a, b))
 a = np.random.randn(1500, 1500).astype(int)
 start = time()
 pardot(a, a, 2, 4)
 time_par = time() - start
 print('pardot: {:.2f} seconds taken'.format(time_par))
 start = time()
 np.dot(a, a)
 time_dot = time() - start
 print('np.dot: {:.2f} seconds taken'.format(time_dot))

With this implementation I get a speedup of approximately x4, which is the physical number of cores in my machine:

running 8 jobs in parallel
pardot: 5.45 seconds taken
np.dot: 22.30 seconds taken

Question 8

It works! This is the O(n**3) matrix product, which does exactly n**2 dot products, correct?

Question 9

It splits the Matrix product into a number of smaller Matrix products. In the extreme case this can be vector dot products.

Question 10

pardot is slower than np.dot when the types are float: running 4 jobs in parallel running 8 jobs in parallel pardot: 0.13 seconds taken np.dot: 0.07 seconds taken

Question 11

even worse when the dataset is 10x the size: pardot: 1212.89 seconds taken np.dot: 73.11 seconds taken

Question 12

@kory This is to be expected. Please use np.dot for floating point multiplication.

Question 13

"Why is it faster to perform float by float matrix multiplication compared to int by int?" explains why integers are so slow: First, the CPUs have high-throughput floating point pipelines. Second, BLAS has no integer-type.

Workaround: Converting the matrices to float32 values gets big speedups. How's 90x speedup on a 2015 MacBook Pro? (Using float64 is half as good.)

import numpy as np
import time
def timeit(callable):
 start = time.time()
 callable()
 end = time.time()
 return end - start
a = np.random.random_integers(0, 9, size=(1000, 1000)).astype(np.int8)
timeit(lambda: a.dot(a)) # ≈0.9 sec
timeit(lambda: a.astype(np.float32).dot(a.astype(np.float32)).astype(np.int8) ) # ≈0.01 sec

MB-F 23.8k5 gold badges71 silver badges127 bronze badges · Accepted Answer · 2016-02-17 12:31:51Z

Note that while this answer gets old numpy might gain optimized integer support. Please verify if this answer still works faster on your setup.

Option 5 - Roll a custom solution: Partition the matrix product in a few sub-products and perform these in parallel. This can be relatively easy implemented with standard Python modules. The sub-products are computed with numpy.dot, which releases the global interpreter lock. Thus, it is possible to use threads which are relatively lightweight and can access the arrays from the main thread for memory efficiency.

Implementation:

import numpy as np
from numpy.testing import assert_array_equal
import threading
from time import time
def blockshaped(arr, nrows, ncols):
 """
 Return an array of shape (nrows, ncols, n, m) where
 n * nrows, m * ncols = arr.shape.
 This should be a view of the original array.
 """
 h, w = arr.shape
 n, m = h // nrows, w // ncols
 return arr.reshape(nrows, n, ncols, m).swapaxes(1, 2)
def do_dot(a, b, out):
 #np.dot(a, b, out) # does not work. maybe because out is not C-contiguous?
 out[:] = np.dot(a, b) # less efficient because the output is stored in a temporary array?
def pardot(a, b, nblocks, mblocks, dot_func=do_dot):
 """
 Return the matrix product a * b.
 The product is split into nblocks * mblocks partitions that are performed
 in parallel threads.
 """
 n_jobs = nblocks * mblocks
 print('running {} jobs in parallel'.format(n_jobs))
 out = np.empty((a.shape[0], b.shape[1]), dtype=a.dtype)
 out_blocks = blockshaped(out, nblocks, mblocks)
 a_blocks = blockshaped(a, nblocks, 1)
 b_blocks = blockshaped(b, 1, mblocks)
 threads = []
 for i in range(nblocks):
 for j in range(mblocks):
 th = threading.Thread(target=dot_func, 
 args=(a_blocks[i, 0, :, :], 
 b_blocks[0, j, :, :], 
 out_blocks[i, j, :, :]))
 th.start()
 threads.append(th)
 for th in threads:
 th.join()
 return out
if __name__ == '__main__':
 a = np.ones((4, 3), dtype=int)
 b = np.arange(18, dtype=int).reshape(3, 6)
 assert_array_equal(pardot(a, b, 2, 2), np.dot(a, b))
 a = np.random.randn(1500, 1500).astype(int)
 start = time()
 pardot(a, a, 2, 4)
 time_par = time() - start
 print('pardot: {:.2f} seconds taken'.format(time_par))
 start = time()
 np.dot(a, a)
 time_dot = time() - start
 print('np.dot: {:.2f} seconds taken'.format(time_dot))

With this implementation I get a speedup of approximately x4, which is the physical number of cores in my machine:

running 8 jobs in parallel
pardot: 5.45 seconds taken
np.dot: 22.30 seconds taken

It works! This is the O(n**3) matrix product, which does exactly n**2 dot products, correct?
It splits the Matrix product into a number of smaller Matrix products. In the extreme case this can be vector dot products.
pardot is slower than np.dot when the types are float: running 4 jobs in parallel running 8 jobs in parallel pardot: 0.13 seconds taken np.dot: 0.07 seconds taken
even worse when the dataset is 10x the size: pardot: 1212.89 seconds taken np.dot: 73.11 seconds taken
@kory This is to be expected. Please use np.dot for floating point multiplication.

CollectivesTM on Stack Overflow

Multi-threaded integer matrix multiplication in NumPy/SciPy

2 Answers 2

6 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Hot Network Questions

CollectivesTM on Stack Overflow

2 Answers 2

6 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Related