Efficiently processing large binary files in python

Question 1

I'm currently reading binary files that are 150,000 kb each. They contain roughly 3,000 structured binary messages and I'm trying to figure out the quickest way to process them. Out of each message, I only need to actually read about 30 lines of data. These messages have headers that allow me to jump to specific portions of the message and find the data I need.

I'm trying to figure out whether it's more efficient to unpack the entire message (50 kb each) and pull my data from the resulting tuple that includes a lot of data I don't actually need, or would it cost less to use seek to go to each line of data I need for every message and unpack each of those 30 lines? Alternatively, is this something better suited to mmap?

Question 2

What do you mean 30 "lines"? The data is binary, so lines don't make much sense. Can you put that in terms of a percentage of each message? Also unless the percentage is near 100% or 0%, you'll probably have to profile to get a useful answer.

Question 3

Sorry, you're right, that wasn't clear at all. Thirty 8 byte segments of binary.

Question 4

And how are they distributed throughout the message? Are they randomly placed, or all in one region, or something in between?

Question 5

They follow a set structure, although, while the messages are consistently sized between messages, they may vary from file to file. My plan had been to read headers for the messages to determine the size and build a format string to unpack the entire message, then pull the data from the tuple. Alternatively, I can use the message headers to find out how many bytes I need to skip to reach the part of the message I want to read and then I can unpack that single piece of binary data to retrieve the variables.

Question 6

I'm just not sure if skipping through the message to unpack 30 integers will be slower than a single unpack operation unpacking several hundred integers.

Question 7

Seeking, possibly several times, within just 50 kB is probably not worthwhile: system calls are expensive. Instead, read each message into one bytes and use slicing to "seek" to the offsets you need and get the right amount of data.

It may be beneficial to wrap the bytes in a memoryview to avoid copying, but for small individual reads it probably doesn’t matter much. If you can use a memoryview, definitely try using mmap, which exposes a similar interface over the whole file. If you’re using struct, its unpack_from can already seek within a bytes or an mmap without wrapping or copying.

Davis Herring 42.3k4 gold badges59 silver badges92 bronze badges · Accepted Answer · 2018-10-11 00:32:03Z

Seeking, possibly several times, within just 50 kB is probably not worthwhile: system calls are expensive. Instead, read each message into one bytes and use slicing to "seek" to the offsets you need and get the right amount of data.

It may be beneficial to wrap the bytes in a memoryview to avoid copying, but for small individual reads it probably doesn’t matter much. If you can use a memoryview, definitely try using mmap, which exposes a similar interface over the whole file. If you’re using struct, its unpack_from can already seek within a bytes or an mmap without wrapping or copying.

CollectivesTM on Stack Overflow

Efficiently processing large binary files in python

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

CollectivesTM on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related