Here is a program that I've wrote to extract JPEGs from a file. It reads a file that contains the image data and separates it into individual images.
import hashlib
inputfile = 'data.txt'
marker = chr(0xFF)+chr(0xD8)
# Input data
imagedump = file(inputfile, "rb").read()
imagedump = imagedump.split(marker)
count=0
for photo in imagedump:
name = hashlib.sha256(photo).hexdigest()[0:16]+".jpg"
file(name, "wb").write(marker+photo)
count=count+1
print count
The script names the identified images with their SHA256 digest, and all of the photos that it finds will be dumped into the current directory.
Here's how I test the script to see if it is working correctly:
- Type
cd ~/images/
- create the folder
mkdir test
- dump some JPEGs into a singe file in the directory
cat *.jpg > ./test/data.txt
cd test
and put the script into the current directory- run the script
python extract.py
, and the JPEGs will be dumped into the current folder
How can I improve my script's performance? Are there any problems with its operation? The script does not know when an image ends; just when a new one starts. Could this cause problems?
-
2\$\begingroup\$ Using only 64 bits of SHA256 is a waste. It's not cryptographically sound anyway, so you might as well use MD5 or SHA1 for slightly faster performance. \$\endgroup\$200_success– 200_success2013年10月21日 07:55:25 +00:00Commented Oct 21, 2013 at 7:55
-
1\$\begingroup\$ Can you explain more about the background to this problem? Why are the JPEG files simply concatenated together? Why are they not stored in a proper archive format like UStar or ZIP? \$\endgroup\$Gareth Rees– Gareth Rees2013年10月22日 14:16:10 +00:00Commented Oct 22, 2013 at 14:16
1 Answer 1
There is an End-of-data marker FF D9
, but you can't scan for it blindly, because those bytes can also appear within a JPEG image. For example, if the JPEG contains a thumbnail, then FF D9
could mark the end of the thumbnail rather than of the whole image. In fact, the FF D8
start-of-image marker can also appear within a JPEG image for the same reason. Therefore, your technique is invalid.
To do a proper job, you must look for JPEG markers, most of which are followed by two bytes indicating the payload size, and advance the indicated number of bytes, until you hit the FF D9
marker. It might even be faster, since you can advance in chunks rather than scanning every byte sequentially.