Searching/reading binary data in Python

Question 1

I'm reading in a binary file (a jpg in this case), and need to find some values in that file. For those interested, the binary file is a jpg and I'm attempting to pick out its dimensions by looking for the binary structure as detailed here.

I need to find FFC0 in the binary data, skip ahead some number of bytes, and then read 4 bytes (this should give me the image dimensions).

What's a good way of searching for the value in the binary data? Is there an equivalent of 'find', or something like re?

Question 2

have you ever looked into imagick? IIRC there is also a python library for it.

Question 3

I have, and it works great, but it's quite heavy for just finding the dimensions of the file.

Question 4

you should use a module appropriate for something like this snippets.dzone.com/posts/show/1021

Question 5

You could actually load the file into a string and search that string for the byte sequence 0xffc0 using the str.find() method. It works for any byte sequence.

The code to do this depends on a couple things. If you open the file in binary mode and you're using Python 3 (both of which are probably best practice for this scenario), you'll need to search for a byte string (as opposed to a character string), which means you have to prefix the string with b.

with open(filename, 'rb') as f:
 s = f.read()
s.find(b'\xff\xc0')

If you open the file in text mode in Python 3, you'd have to search for a character string:

with open(filename, 'r') as f:
 s = f.read()
s.find('\xff\xc0')

though there's no particular reason to do this. It doesn't get you any advantage over the previous way, and if you're on a platform that treats binary files and text files differently (e.g. Windows), there is a chance this will cause problems.

Python 2 doesn't make the distinction between byte strings and character strings, so if you're using that version, it doesn't matter whether you include or exclude the b in b'\xff\xc0'. And if your platform treats binary files and text files identically (e.g. Mac or Linux), it doesn't matter whether you use 'r' or 'rb' as the file mode either. But I'd still recommend using something like the first code sample above just for forward compatibility - in case you ever do switch to Python 3, it's one less thing to fix.

Question 6

If it's a really big file, it's not such a good idea to read it into a string all at once.

Question 7

I doubt it's so big it's going to be a problem.

Question 8

Since I'm only looking for the first frame I'll likely be able to read some small part of the file and process that instead of reading the whole file.

Question 9

@icktoofay: good point, but I would point out that you can do exactly what Parand is saying, just read the first N bytes and search those. If you did have to search all of a large file for a byte sequence, it could be done iteratively so you wouldn't have to keep the whole thing in memory at once, but the code would be a little more involved, and I didn't think it'd be necessary to get into that here.

Question 10

@JannePaalijarvi Yes, if you have a different problem than the OP, than the solution which works for the OP may not work for you. My comment is relevant to the problem as described, not yours.

Question 11

Instead of reading the entire file into memory, searching it and then writing a new file out to disk you can use the mmap module for this. mmap will not store the entire file in memory and it allows for in-place modification.

#!/usr/bin/python
import mmap
with open("hugefile", "rw+b") as f:
 mm = mmap.mmap(f.fileno(), 0)
 print mm.find('\x00\x09\x03\x03')

Question 12

The bitstring module was designed for pretty much this purpose. For your case the following code (which I haven't tested) should help illustrate:

from bitstring import ConstBitStream
# Can initialise from files, bytes, etc.
s = ConstBitStream(filename='your_file')
# Search to Start of Frame 0 code on byte boundary
found = s.find('0xffc0', bytealigned=True)
if found:
 print("Found start code at byte offset %d." % found[0])
 s0f0, length, bitdepth, height, width = s.readlist('hex:16, uint:16, 
 uint:8, 2*uint:16')
 print("Width %d, Height %d" % (width, height))

Question 13

So Bits.find returns just a boolean and sets the Bits.bytepos attribute? Perhaps in the module documentation you should warn that bitstring is not thread-safe (not that it matters in this answer, of course).

Question 14

@ΤΖΩΤΖΙΟΥ: Yes you have a good point. I don't find it surprising that mutating methods or reading methods aren't thread safe, but using 'find' on a bit-wise immutable object could reasonably be expected to be. To be honest it's never cropped up before but it is something to think about...

Question 15

Just an idea: find could return an object with all necessary information, à la re.match and re.search. You could have this "BitMatch" class be a subclass of bool, for backwards compatibility.

Question 16

@ΤΖΩΤΖΙΟΥ: Thanks, that's a reasonable idea although I'm in a good position to break backward compatibility slightly and maybe just have it return the bit position as a single item tuple if found or an empty tuple if not found. I guess anything's better than returning -1 if not found :)

Question 17

In Python 3.x you can search a byte string by another byte string like this:

>>> byte_array = b'this is a byte array\r\n\r\nXYZ\x80\x04\x95 \x00\x00\x00\x00\x00'
>>> byte_array.find('\r\n\r\n'.encode())
20
>>>

Question 18

The re module does work with both string and binary data (str in Python 2 and bytes in Python 3), so you can use it as well as str.find for your task.

Question 19

The find() method should be used only if you need to know the position of sub, if not, you can use the in operator, for example:

with open("foo.bin", 'rb') as f:
 if b'\x00' in f.read():
 print('The file is binary!')
 else:
 print('The file is not binary!')

Question 20

This did it for me - I was trying to compare a string to a byte string. All I had to do was put the b in front of my search term and it was found within the byte string.

Question 21

Well, obviously there is PIL The Image module has size as an attribute. If you are wanting to get the size exactly how you suggest and without loading the file you are going to have to go through it line by line. Not the nicest way to do it but it would work.

Question 22

For Python >=3.2:

import re
f = open("filename.jpg", "rb")
byte = f.read()
f.close()
matchObj = re.match( b'\xff\xd8.*\xff\xc0...(..)(..).*\xff\xd9', byte, re.MULTILINE|re.DOTALL)
if matchObj:
 # https://stackoverflow.com/q/444591
 print (int.from_bytes(matchObj.group(1), 'big')) # height
 print (int.from_bytes(matchObj.group(2), 'big')) # width

David Z 133k29 gold badges264 silver badges284 bronze badges · Accepted Answer · 2010-07-10 00:48:58Z

You could actually load the file into a string and search that string for the byte sequence 0xffc0 using the str.find() method. It works for any byte sequence.

The code to do this depends on a couple things. If you open the file in binary mode and you're using Python 3 (both of which are probably best practice for this scenario), you'll need to search for a byte string (as opposed to a character string), which means you have to prefix the string with b.

with open(filename, 'rb') as f:
 s = f.read()
s.find(b'\xff\xc0')

If you open the file in text mode in Python 3, you'd have to search for a character string:

with open(filename, 'r') as f:
 s = f.read()
s.find('\xff\xc0')

though there's no particular reason to do this. It doesn't get you any advantage over the previous way, and if you're on a platform that treats binary files and text files differently (e.g. Windows), there is a chance this will cause problems.

Python 2 doesn't make the distinction between byte strings and character strings, so if you're using that version, it doesn't matter whether you include or exclude the b in b'\xff\xc0'. And if your platform treats binary files and text files identically (e.g. Mac or Linux), it doesn't matter whether you use 'r' or 'rb' as the file mode either. But I'd still recommend using something like the first code sample above just for forward compatibility - in case you ever do switch to Python 3, it's one less thing to fix.

If it's a really big file, it's not such a good idea to read it into a string all at once.
Since I'm only looking for the first frame I'll likely be able to read some small part of the file and process that instead of reading the whole file.
@icktoofay: good point, but I would point out that you can do exactly what Parand is saying, just read the first N bytes and search those. If you did have to search all of a large file for a byte sequence, it could be done iteratively so you wouldn't have to keep the whole thing in memory at once, but the code would be a little more involved, and I didn't think it'd be necessary to get into that here.
@JannePaalijarvi Yes, if you have a different problem than the OP, than the solution which works for the OP may not work for you. My comment is relevant to the problem as described, not yours.

CollectivesTM on Stack Overflow

Searching/reading binary data in Python

8 Answers 8

9 Comments

Comments

4 Comments

Comments

Comments

1 Comment

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Hot Network Questions

CollectivesTM on Stack Overflow

8 Answers 8

9 Comments

Comments

4 Comments

Comments

Comments

1 Comment

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Related