Reading sparse matrix from binary file

Question 1

I have binary files containing sparse matrices. Their format is:

number of rows int
 length of a row int
 column index int
 value float

Reading each row with a single struct call instead of looping through each row with single struct calls gave me roughly a 2-fold speedup. I'm parsing 1 GB sized matrices and I would like to speed this proces up even further.

from scipy.sparse import coo_matrix
import struct
def read_sparse_matrix(handle):
 cols = []
 rows = []
 weights = []
 numrows = struct.unpack('i' , handle.read(4))[0]
 shape = numrows
 for rownum in range(numrows):
 rowlen = struct.unpack('i', handle.read(4))[0]
 row = list(struct.unpack("if" * rowlen, handle.read(8 * rowlen)))
 cols += row[::2]
 weights += row[1::2]
 rows += [rownum] * rowlen
 return coo_matrix((weights, (rows, cols)), shape=(shape, shape))

A file contains multiple of these matrices, and other informatinon, so the size of the file is not informative about the structure of the matrix.

Question 2

Can you change the format? If so (and you don't need it to be human readable), you should probably just serialize the data. That should be faster.

Question 3

Have you tried reading (or mmapping) the entire file into a buffer, and using struct.unpack_from to decode the data?

Question 4

Can you link a sample file?

Question 5

This question is about elapsed run times. Please include cProfile observations as part of the question.

The irregular on-disk structure of the data isn't doing you any favors, as it complicates any approach that wants to process bigger chunks at a time. Barring cython or numba JIT, the current code looks like it's about as fast as it's going to get.

The slicing with a stride of 2, for cols and weights is very nice.

Once we've written a giant file, it's unclear how many times it will be read. You might care to reformat the data to support multiple re-reads.

Consider changing the on-disk format using savez_compressed so that at read time you can take advantage of a rapid load. (The parquet compressed format is also fairly attractive.)

J_H J_H 41.4k3 gold badges38 silver badges157 bronze badges · Answer 1 · 2023-01-06 22:05:11Z

This question is about elapsed run times. Please include cProfile observations as part of the question.

The irregular on-disk structure of the data isn't doing you any favors, as it complicates any approach that wants to process bigger chunks at a time. Barring cython or numba JIT, the current code looks like it's about as fast as it's going to get.

The slicing with a stride of 2, for cols and weights is very nice.

Once we've written a giant file, it's unclear how many times it will be read. You might care to reformat the data to support multiple re-reads.

Consider changing the on-disk format using savez_compressed so that at read time you can take advantage of a rapid load. (The parquet compressed format is also fairly attractive.)

Stack Exchange Network

Reading sparse matrix from binary file

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Reading sparse matrix from binary file

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions