I have binary files containing sparse matrices. Their format is:
number of rows int
length of a row int
column index int
value float
Reading each row with a single struct call instead of looping through each row with single struct calls gave me roughly a 2-fold speedup. I'm parsing 1 GB sized matrices and I would like to speed this proces up even further.
from scipy.sparse import coo_matrix
import struct
def read_sparse_matrix(handle):
cols = []
rows = []
weights = []
numrows = struct.unpack('i' , handle.read(4))[0]
shape = numrows
for rownum in range(numrows):
rowlen = struct.unpack('i', handle.read(4))[0]
row = list(struct.unpack("if" * rowlen, handle.read(8 * rowlen)))
cols += row[::2]
weights += row[1::2]
rows += [rownum] * rowlen
return coo_matrix((weights, (rows, cols)), shape=(shape, shape))
A file contains multiple of these matrices, and other informatinon, so the size of the file is not informative about the structure of the matrix.
1 Answer 1
This question is about elapsed run times. Please include cProfile observations as part of the question.
The irregular on-disk structure of the data isn't doing you any favors, as it complicates any approach that wants to process bigger chunks at a time. Barring cython or numba JIT, the current code looks like it's about as fast as it's going to get.
The slicing with a stride of 2,
for cols
and weights
is very nice.
Once we've written a giant file, it's unclear how many times it will be read. You might care to reformat the data to support multiple re-reads.
Consider changing the on-disk format using savez_compressed so that at read time you can take advantage of a rapid load. (The parquet compressed format is also fairly attractive.)
Explore related questions
See similar questions with these tags.
mmap
ping) the entire file into a buffer, and usingstruct.unpack_from
to decode the data? \$\endgroup\$