I'm currently reading binary files that are 150,000 kb each. They contain roughly 3,000 structured binary messages and I'm trying to figure out the quickest way to process them. Out of each message, I only need to actually read about 30 lines of data. These messages have headers that allow me to jump to specific portions of the message and find the data I need.
I'm trying to figure out whether it's more efficient to unpack the entire message (50 kb each) and pull my data from the resulting tuple that includes a lot of data I don't actually need, or would it cost less to use seek to go to each line of data I need for every message and unpack each of those 30 lines? Alternatively, is this something better suited to mmap?
-
What do you mean 30 "lines"? The data is binary, so lines don't make much sense. Can you put that in terms of a percentage of each message? Also unless the percentage is near 100% or 0%, you'll probably have to profile to get a useful answer.bnaecker– bnaecker2018年03月05日 21:57:29 +00:00Commented Mar 5, 2018 at 21:57
-
Sorry, you're right, that wasn't clear at all. Thirty 8 byte segments of binary.AEvers– AEvers2018年03月15日 12:11:54 +00:00Commented Mar 15, 2018 at 12:11
-
And how are they distributed throughout the message? Are they randomly placed, or all in one region, or something in between?bnaecker– bnaecker2018年03月15日 14:50:48 +00:00Commented Mar 15, 2018 at 14:50
-
They follow a set structure, although, while the messages are consistently sized between messages, they may vary from file to file. My plan had been to read headers for the messages to determine the size and build a format string to unpack the entire message, then pull the data from the tuple. Alternatively, I can use the message headers to find out how many bytes I need to skip to reach the part of the message I want to read and then I can unpack that single piece of binary data to retrieve the variables.AEvers– AEvers2018年03月16日 12:51:52 +00:00Commented Mar 16, 2018 at 12:51
-
I'm just not sure if skipping through the message to unpack 30 integers will be slower than a single unpack operation unpacking several hundred integers.AEvers– AEvers2018年03月16日 12:53:21 +00:00Commented Mar 16, 2018 at 12:53
1 Answer 1
Seeking, possibly several times, within just 50 kB is probably not worthwhile: system calls are expensive. Instead, read each message into one bytes and use slicing to "seek" to the offsets you need and get the right amount of data.
It may be beneficial to wrap the bytes in a memoryview to avoid copying, but for small individual reads it probably doesn’t matter much. If you can use a memoryview, definitely try using mmap, which exposes a similar interface over the whole file. If you’re using struct, its unpack_from can already seek within a bytes or an mmap without wrapping or copying.