JPEG extraction script

Question 1

Here is a program that I've wrote to extract JPEGs from a file. It reads a file that contains the image data and separates it into individual images.

import hashlib
inputfile = 'data.txt'
marker = chr(0xFF)+chr(0xD8)
# Input data
imagedump = file(inputfile, "rb").read()
imagedump = imagedump.split(marker)
count=0
for photo in imagedump:
 name = hashlib.sha256(photo).hexdigest()[0:16]+".jpg"
 file(name, "wb").write(marker+photo)
 count=count+1
 print count

The script names the identified images with their SHA256 digest, and all of the photos that it finds will be dumped into the current directory.

Here's how I test the script to see if it is working correctly:

Type cd ~/images/
create the folder mkdir test
dump some JPEGs into a singe file in the directory cat *.jpg > ./test/data.txt
cd test and put the script into the current directory
run the script python extract.py, and the JPEGs will be dumped into the current folder

How can I improve my script's performance? Are there any problems with its operation? The script does not know when an image ends; just when a new one starts. Could this cause problems?

Question 2

Using only 64 bits of SHA256 is a waste. It's not cryptographically sound anyway, so you might as well use MD5 or SHA1 for slightly faster performance.

Question 3

Can you explain more about the background to this problem? Why are the JPEG files simply concatenated together? Why are they not stored in a proper archive format like UStar or ZIP?

Question 4

There is an End-of-data marker FF D9, but you can't scan for it blindly, because those bytes can also appear within a JPEG image. For example, if the JPEG contains a thumbnail, then FF D9 could mark the end of the thumbnail rather than of the whole image. In fact, the FF D8 start-of-image marker can also appear within a JPEG image for the same reason. Therefore, your technique is invalid.

To do a proper job, you must look for JPEG markers, most of which are followed by two bytes indicating the payload size, and advance the indicated number of bytes, until you hit the FF D9 marker. It might even be faster, since you can advance in chunks rather than scanning every byte sequentially.

200_success 200_success 146k22 gold badges190 silver badges479 bronze badges · Answer 1 · 2013-10-21 03:20:28Z

There is an End-of-data marker FF D9, but you can't scan for it blindly, because those bytes can also appear within a JPEG image. For example, if the JPEG contains a thumbnail, then FF D9 could mark the end of the thumbnail rather than of the whole image. In fact, the FF D8 start-of-image marker can also appear within a JPEG image for the same reason. Therefore, your technique is invalid.

To do a proper job, you must look for JPEG markers, most of which are followed by two bytes indicating the payload size, and advance the indicated number of bytes, until you hit the FF D9 marker. It might even be faster, since you can advance in chunks rather than scanning every byte sequentially.

Stack Exchange Network

JPEG extraction script

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

JPEG extraction script

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions