I have a bunch of photos in a folder with various EXIF data, and I'd like to output various parts of it to Excel. I'm learning Python (currently using 2.7) and thought this would be a fun task for me to try out, as it incorporates functions, loops, and two libraries (I'm using PIL and Openpxyl).
The code currently works fine! I'm able to get data for about 650 images in under three seconds.
Mainly, I'm trying to learn how to better structure the project. My main "concerns" are with how I'm calling my functions. For example, right now, I want to get the Latitude, Longitude, and DateTime the photo was taken. But say I add another function (i.e. get_Exposure()
), I'd like to see if I can better write the writeToFile()
function to handle that. Coming from a VBA background, I'm thinking I could loop a single line like ws1.cell(column=[first variable], row=row, value=[first variable value])
somehow.
Finally, am I "calling" all of these functions properly? The whole declaring of variables before the for root, dirs, ...
line seems out of place to me for some reason. (FWIW, I am mainly aquainted with VBA, so my thinking is all coming from how one does things in that...)
from PIL import Image
from PIL.ExifTags import TAGS, GPSTAGS
import os, sys
from openpyxl import Workbook
from openpyxl.compat import range
from openpyxl.utils import get_column_letter
def _get_if_exist(data, key):
if key in data:
return data[key]
return None
def get_exif_data(fn):
"""Returns a dictionary from the exif data of an PIL Image item. Also converts the GPS Tags"""
image = Image.open(fn)
exif_data = {}
info = image._getexif()
if info:
for tag, value in info.items():
decoded = TAGS.get(tag, tag)
if decoded == "GPSInfo":
gps_data = {}
for t in value:
sub_decoded = GPSTAGS.get(t, t)
gps_data[sub_decoded] = value[t]
exif_data[decoded] = gps_data
else:
exif_data[decoded] = value
return exif_data
def _convert_to_degrees(value):
"""Helper function to convert the GPS coordinates stored in the EXIF to degrees in float format"""
d0 = value[0][0]
d1 = value[0][1]
d = float(d0) / float(d1)
m0 = value[1][0]
m1 = value[1][1]
m = float(m0) / float(m1)
s0 = value[2][0]
s1 = value[2][1]
s = float(s0) / float(s1)
return d + (m / 60.0) + (s / 3600.0)
def get_time_taken(exif_data):
timeTaken = None
if "DateTimeOriginal" in exif_data:
timeTaken = exif_data["DateTimeOriginal"]
return timeTaken
def get_lat(exif_data):
lat = None
if "GPSInfo" in exif_data:
gps_info = exif_data["GPSInfo"]
gps_latitude = _get_if_exist(gps_info, "GPSLatitude")
gps_latitude_ref = _get_if_exist(gps_info, 'GPSLatitudeRef')
if gps_latitude and gps_latitude_ref:
lat = _convert_to_degrees(gps_latitude)
if gps_latitude_ref != "N":
lat = 0 - lat
return lat
def get_lon(exif_data):
lon = None
if "GPSInfo" in exif_data:
gps_info = exif_data["GPSInfo"]
gps_longitude = _get_if_exist(gps_info,"GPSLongitude")
gps_longitude_ref = _get_if_exist(gps_info, 'GPSLongitudeRef')
if gps_longitude and gps_longitude_ref:
lon = _convert_to_degrees(gps_longitude)
if gps_longitude_ref != "E":
lon - 0 - lon
return lon
def writeToFile(imageName, lat, lon, row, timeTaken, ws1):
ws1.cell(column=1, row=row, value=imageName)
ws1.cell(column=2, row=row, value=lat)
ws1.cell(column=3,row=row, value=lon)
ws1.cell(column=4, row=row,value=timeTaken)
def saveFile(wb, xlFile):
wb.save(filename = xlFile)
row = 1
wb = Workbook()
ws1 = wb.active
ws1.title = "GPS Coords"
xlFile = "D:\\myUser\\Pictures\\Digital Pictures\\GPSCoords.xlsx"
for root, dirs, filenames in os.walk("D:\\myUser\\Pictures\\Digital Pictures\\"):
for imageName in filenames:
if imageName[-4:] == ".jpg":
fn = "D:\\myUser\\Pictures\\Digital Pictures\\" + imageName
exif_data = get_exif_data(fn)
get_exif_data(fn)
lat = str(get_lat(exif_data))
lon = str(get_lon(exif_data))
timeTaken = str(get_time_taken(exif_data))
print imageName + ": " + lat + ", " + lon + "; " + timeTaken
writeToFile(imageName, lat, lon, row, timeTaken, ws1)
row += 1
saveFile(wb, xlFile)
2 Answers 2
I would change from python 2 to Python 3. There are so many good changes in Python 3, among which for this problem unicode handling is most important that it's worth it.
for the exif-data, PILLOW
should be a simple replacement for PIL
general remarks
def _get_if_exist(data, key)
python dicts have get()
method with a default
argument. Instead of calling making your own function, you can easily do d.get(key, None)
Seperation of functions
Now you loop over the file, check if it is an image and process it in 1 loop. I suggest using 1 function to find all images, a second function to extract all exif-information, a third function to extract the important information, and then a function to bring it all together
My attempt
find all images
def find_images(image_dir, extensions=None):
default_extensions = ('jpg', 'jpeg')
if extensions is None:
extensions = default_extensions
elif isinstance(extensions, str):
extensions = (extensions,)
for root, dirs, filenames in os.walk(image_dir):
for filename in filenames:
# print(filename, filename.split('.', 1))
if filename.split('.', 1)[-1].lower() in extensions:
yield os.path.join(root, filename)
takes a starting directory and a collection of extensions. It uses str.split('.')
to get the extension, instead of the arbitrary [-4:]
This is a generator, which yields the path to an image every iteration. You could make the output more sophisticated by yield filename, os.path.join(root, filename)
or yielding a pathlib.Path
instead of a str
Getting all exif data
def process_exif_data(image):
decoded_exif = {}
with Image.open(image) as image_file:
exif_data = image_file._getexif()
if exif_data is None:
return None
for tag, value in exif_data.items():
decoded = TAGS.get(tag, tag)
if decoded == "GPSInfo":
decoded_exif.update(decode_gps_data(value))
else:
decoded_exif[decoded] = value
# This could be done with a dict comprehension and a ternary expression too
return decoded_exif
This is pretty much your solution, only I put the GPSInfo
into the dict with all exif-info, instead of nested a level deeper. I also do the processing of the GPS-data here already instead of later on
process the GPS-data
def decode_gps_data(info):
gps_tags = {GPSTAGS.get(k, k): v for k, v in value.items}
lat, long = get_coordinates(gps_tags)
gps_tags['lat'] = lat
gps_tags['lon'] = lon
return gps_tags
This should speak for itself.
get the coordinates
def get_coordinates(gps_tags):
coords = {'Latitude': 'N', 'Longitude': 'E'}
for coord, nominal_ref in coords.items():
c = gps_tags.get("GPS%s" % coord, None)
c_ref = gps_tags.get("GPS%sRef" % coord, None)
if c and c_ref:
yield _convert_to_degrees(c, c_ref, nominal_ref)
else:
yield None
the code to get the latitude and the longitude is the same. The only difference is the nominal reference ('N' or 'E') and the tag, so I abstracted this.
def _convert_to_degrees(value, ref, nominal_ref=None:
if nominal_ref is None:
nominal_ref = ('N', 'E',)
elif isinstance(nom, str):
nominal_ref = (nominal_ref, )
ref = 1 if ref in nominal_ref else -1
return ref * sum(float(v[0]) / float(v[1]) / 60 ** i for i, v in enumerate(value))
Instead of O - calculated_degrees
like you do, I multiply by 1
or -1
depending on the reference. The calculation itself uses tuple unpacking and enumerate
to do the actual calculation. Since I don't have images with, I have no data to check it with, but it should do the same as your get_lat
and get_lon
.
Extract the importand data
def extract_important_data(image_data, important_datalabels=('lat', 'lon', 'DateTimeOriginal')):
if image_data is None:
return None
return {key: image_data.get(key, None) for key in important_datalabels}
This just returns a selection of the dict of all exif_data. You can specify which tags are important to you, so you can easily expand the needed information later
Bringing it together
import PIL
from PIL import Image
from PIL.ExifTags import TAGS, GPSTAGS
import pandas as pd
import os
The imports. I use pandas
instead of openpyxl
directly, since that allows me a lot more freedom to do smaller data processing afterwards
def extract_info(images, important_datalabels=('lat', 'lon', 'DateTimeOriginal')):
for image_path in images:
exif_data = process_exif_data(image_path)
yield image_path, extract_important_data(exif_data, important_datalabels=important_datalabels)
This just iterates over all images thrown at it, and yields the image and the important data in the exif
If you don't want to include the images without EXIF-info in your final results, you can do it like this
def extract_info(images, important_datalabels=('lat', 'lon', 'DateTimeOriginal')):
for image_path in images:
exif_data = process_exif_data(image_path)
important_data = extract_important_data(exif_data)
if important_data:
yield image_path, important_data
main()
def main(image_dir=None, filename=None, important_datalabels=('lat', 'lon', 'DateTimeOriginal')):
if image_dir is None:
image_dir='.'
images = find_images(image_dir)
info = extract_info(images, important_datalabels=important_datalabels)
result_df = pd.DataFrame(columns = important_datalabels)
for image_path, item in info:
result_df.loc[image_path] = item
if 'DateTimeOriginal' in important_datalabels:
date_format = '%Y:%m:%d %H:%M:%S'
result_df['DateTimeOriginal'] = pd.to_datetime(result_df['DateTimeOriginal'], format=date_format)
if filename:
result_df.to_excel(filename)
return result_df
This is the method that really ties everything together.
- It looks for all the images in
image_dir
, if noextensions
are passed on, takes the default extensions in that method - extracts the important info from those images
- makes an empty
pandas.DataFrame
with the important datalabels as columns - starts filling this DataFrame
- changes the date to a
datetime.datetime
object - if a filename is passed on, writes the DataFrame to this filename
result
For me this yielded
lat lon DateTimeOriginal
.\data\images\image-13.jpg NaN NaN NaT
.\data\images\piazza-nite-2-big.jpg None None 2006年06月07日 22:53:09
-
\$\begingroup\$ Woah, thanks for this. I'm going to read it over and see what I can do and let you know any questions. Much appreciated! \$\endgroup\$BruceWayne– BruceWayne2017年06月14日 04:58:08 +00:00Commented Jun 14, 2017 at 4:58
-
\$\begingroup\$ Hm - I took your advice and am working in Python3. However, I can't even seem to get the
find_images
function to work. I uncommented theprint()
line, and nothing happens when I run it. I call the function, for testing, withfind_images("D:\\user\\imageFolder")
and nothing happens. It runs, no compile error, but nothing is printed to the console/terminal (whatever it's called). Then, I tried calling it withmain()
like you have it, but nothing happens too. ...what's theimage_dir='.'
do? \$\endgroup\$BruceWayne– BruceWayne2017年06月15日 03:32:56 +00:00Commented Jun 15, 2017 at 3:32 -
\$\begingroup\$
find_images
is a generator, which is lazily evaluated. If you want to force evaluation, you can runlist(find_images(<image_dir>))
.Calling themain()
with the correctimage_dir
passes this generator on toextract_info
, which iterates over it. Theimage_dir='.'
is just a default image_dir for the main method \$\endgroup\$Maarten Fabré– Maarten Fabré2017年06月15日 08:54:29 +00:00Commented Jun 15, 2017 at 8:54 -
\$\begingroup\$ Sorry for the delay on marking as answer - as mentioned I'm learning and this was a pretty dense (to me) answer, so it took some time going through it. Thanks again! :D \$\endgroup\$BruceWayne– BruceWayne2017年07月04日 02:47:46 +00:00Commented Jul 4, 2017 at 2:47
Here's some of my thoughts on your code:
You should not leave a blank line in between returns in the
_get_if_exists
function;Functions should be seperated by two blank lines;
Variable names should be
lowercase_with_underscores
(unliketimeTaken
, for example);Function names should also follow variable naming conventions (unlike
writeToFile
andsaveFile
).
All of these following PEP-8. Some other recommendations I have:
Functions will, if
return
is not explicitly called, return None by default, so there's no reason to usereturn None
(under almost all circumstances);You can use
r"Path/To/File"
(raw string), so there's no need to use escape sequences:Both string and bytes literals may optionally be prefixed with a letter 'r' or 'R'; such strings are called raw strings and treat backslashes as literal characters. Lexical Analysis
Instead of manually opening and closing the file, you can use the with keyword (
with open(file_name, "r") as f:
(to open a file in read mode and alias it f). This also takes care of closing the file for you and is generally more intuitive;The last part of the code could be wrapped in a
main()
function, which can be then called conditionally.
As a response to your concern about the way you're declaring variables, it's generally a better idea to do this at the top of the file (but below the imports).
In your writeToFile()
function, you could- well, I'll just rewrite it:
def write_to_file(*args, row, ws1):
ws_ = ws1
row_ = row
for count, arg in enumerate(args):
ws1.cell(column=count, row=row_, value=arg)
If you're unfamiliar with *args / **kwargs, read this.
-
1\$\begingroup\$ Thanks for this! I implemented your suggestions, but can't seem to get the
write_to_file
to properly work. I get an "Invalid Syntax" error when doingdef write_to_file(*args, row, ws1):
followed by those four lines. I'm calling it withwrite_to_file(image_name, lat, lon, time_taken, row, ws1)
(note that I movedrow, ws1
to the end, as I assume that*args
will use the variables that come before the last two. Also, would I use thewith open(...
in mymain()
function? \$\endgroup\$BruceWayne– BruceWayne2017年06月14日 04:56:09 +00:00Commented Jun 14, 2017 at 4:56 -
1\$\begingroup\$ Python 2 works slightly differently, my bad. Try
def write_to_file(row, ws1, *args)
. Regarding the use of the context manager (with open()
), yes, preferably put that inmain()
and indent everything once. \$\endgroup\$Daniel– Daniel2017年06月14日 05:05:37 +00:00Commented Jun 14, 2017 at 5:05 -
\$\begingroup\$ All the more reason for me to go to Python3.x ...anyways, I tried that and now my error is "Row or column values must be at least 1", despite my having that
row = 1
line...Hmm. \$\endgroup\$BruceWayne– BruceWayne2017年06月14日 05:10:32 +00:00Commented Jun 14, 2017 at 5:10 -
\$\begingroup\$ Can you reproduce the exact error? Also, could you add an
assert row => 1
in main()? \$\endgroup\$Daniel– Daniel2017年06月14日 05:24:26 +00:00Commented Jun 14, 2017 at 5:24 -
1\$\begingroup\$ Let me get back to you - I'm still trying to figure out how to do the
main()
thing. Is that this,__name__ == "__main__"
or something else? \$\endgroup\$BruceWayne– BruceWayne2017年06月14日 05:40:08 +00:00Commented Jun 14, 2017 at 5:40
PIL
works with it and apparently not3.x
...but actually, once I get the above kind of cleared up, I'll just ditch 2.7 and go to 3.x. :/ (I see there's Pillow for 3.x, so I should've just used that. Not that I chose 2.7 because PIL was there, just because the formulas I found for EXIF data all used PIL, so just thought to at least get that part understood, then just switch over after I get a handle of the basics. \$\endgroup\$