6
\$\begingroup\$

I have binary files containing sparse matrices. Their format is:

number of rows int
 length of a row int
 column index int
 value float

Reading each row with a single struct call instead of looping through each row with single struct calls gave me roughly a 2-fold speedup. I'm parsing 1 GB sized matrices and I would like to speed this proces up even further.

from scipy.sparse import coo_matrix
import struct
def read_sparse_matrix(handle):
 cols = []
 rows = []
 weights = []
 numrows = struct.unpack('i' , handle.read(4))[0]
 shape = numrows
 for rownum in range(numrows):
 rowlen = struct.unpack('i', handle.read(4))[0]
 row = list(struct.unpack("if" * rowlen, handle.read(8 * rowlen)))
 cols += row[::2]
 weights += row[1::2]
 rows += [rownum] * rowlen
 return coo_matrix((weights, (rows, cols)), shape=(shape, shape))

A file contains multiple of these matrices, and other informatinon, so the size of the file is not informative about the structure of the matrix.

200_success
145k22 gold badges190 silver badges478 bronze badges
asked Mar 8, 2019 at 9:15
\$\endgroup\$
3
  • 2
    \$\begingroup\$ Can you change the format? If so (and you don't need it to be human readable), you should probably just serialize the data. That should be faster. \$\endgroup\$ Commented Mar 12, 2019 at 15:07
  • 1
    \$\begingroup\$ Have you tried reading (or mmapping) the entire file into a buffer, and using struct.unpack_from to decode the data? \$\endgroup\$ Commented Mar 12, 2019 at 22:58
  • 2
    \$\begingroup\$ Can you link a sample file? \$\endgroup\$ Commented Sep 16, 2019 at 18:27

1 Answer 1

1
\$\begingroup\$

This question is about elapsed run times. Please include cProfile observations as part of the question.


The irregular on-disk structure of the data isn't doing you any favors, as it complicates any approach that wants to process bigger chunks at a time. Barring cython or numba JIT, the current code looks like it's about as fast as it's going to get.

The slicing with a stride of 2, for cols and weights is very nice.


Once we've written a giant file, it's unclear how many times it will be read. You might care to reformat the data to support multiple re-reads.

Consider changing the on-disk format using savez_compressed so that at read time you can take advantage of a rapid load. (The parquet compressed format is also fairly attractive.)

answered Jan 6, 2023 at 22:05
\$\endgroup\$

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.