Split latitude/longitude by degree to make file names and folder directory names

Question 1

This code works and does exactly as I want, but I am wondering if/how it could be done faster/more efficiently? This runs quickly on a demo file, but bogs down when I introduce much larger files (3-100GB).

Premise:

Take a large list of latitudes and longitudes remove the negative values (WGS1984) and put them into a North, South, East, West space. Eliminate decimal places.

Use a shortened version of latitude and longitudes for directory creation.

Code:

import numpy as np
import pandas as pd
import h5py
hdf_file = '/path/to/file/lat_lon.h5'
ndir_root = '/output/path/where/directories/are/created/'
# split the file into two chunks to avoid memory error.
chunks = [0, 50, 100]
# open chunked file
for half in range(len(chunks) - 1):
 hf = h5py.File(hdf_file, 'r')
 print 'file loaded. reading meta lat/lon.'
 meta_latitude = hf['latitude'][chunks[half]:chunks[half + 1]]
 meta_longitude = hf['longitude'][chunks[half]:chunks[half + 1]]
 meta_df = pd.DataFrame({'longitude': meta_longitude, 'latitude': meta_latitude})
 print meta_df
 # create an empty list to store coordinates
 lat_list = []
 lon_list = []
 for lat in meta_df['latitude']:
 lat_list.append(lat)
 for lon in meta_df['longitude']:
 lon_list.append(lon)
 # take coordinates from list and convert to a numpy array
 latitude_list = np.asarray(lat_list, dtype=float)
 longitude_list = np.asarray(lon_list, dtype=float)
 # convert numpy array to a string
 lat_list_str = latitude_list.astype(str)
 # make copies to avoid contamination in for loop processing
 lat_list_str_copy_1 = lat_list_str
 lat_list_str_copy_2 = lat_list_str
 # remove negative values, add North, South, East West.
 letter = set('-')
 for lat_check in lat_list_str_copy_1:
 if letter & set(lat_check):
 lat_list_str3 = [i.replace('-', 'S') for i in lat_list_str_copy_1]
 else:
 lat_list_str3 = ['N' + lat_addchar for lat_addchar in lat_list_str_copy_1]
 # remove the decimal for directory creation
 lat = [i.replace('.', '') for i in lat_list_str3]
 lon_list_str = longitude_list.astype(str)
 # make copies to avoid contamination in for loop processing
 lon_list_str_copy_1 = lon_list_str
 lon_list_str_copy_2 = lon_list_str
 for lon_check in lon_list_str_copy_1:
 if letter & set(lon_check):
 lon_list_str3 = [i.replace('-', 'W') for i in lon_list_str_copy_1]
 else:
 lon_list_str3 = ['E' + lon_addchar for lon_addchar in lon_list_str_copy_1]
 lon = [i.replace('.', '') for i in lon_list_str3]
 for i in lat:
 if len(i) < 4:
 i = i + '0'
 for i in lon:
 if len(i) < 5:
 i = '0' + i
 meta_df['y'] = lat
 meta_df['x'] = lon
 meta_df['lon_list_str'] = lon_list_str
 meta_df['lat_list_str'] = lat_list_str
 # convert to int and take only values to left of decimal as folder name
 short_lon_int = [i.split('.')[0] for i in lon_list_str_copy_2]
 short_lat_int = [i.split('.')[0] for i in lat_list_str_copy_2]
 short_lon = map(int, short_lon_int)
 short_lat = map(int, short_lat_int)
 for i in short_lat:
 if i % 3 > 0:
 i -= 1
 for i in short_lon:
 if i % 3 > 0:
 i -= 1
 short_lat_str = map(str, short_lat)
 short_lon_str = map(str, short_lon)
 for lat_check in short_lat_str:
 if letter & set(lat_check):
 short_lat_int_final = [i.replace('-', 'S') for i in short_lat_str]
 else:
 short_lat_int_final = ['N' + lat_addchar for lat_addchar in short_lat_str]
 for lon_check in short_lon_str:
 if letter & set(lon_check):
 short_lon_int_final = [i.replace('-', 'W') for i in short_lon_str]
 else:
 short_lon_int_final = ['E' + lon_addchar for lon_addchar in short_lon_str]
 for a, b in zip(short_lon_int_final, short_lat_int_final):
 ndir = a + b
 print ndir
 if not os.path.exists(ndir_root + ndir):
 os.mkdir(ndir_root + ndir)

Please let me know if anything needs clarification. This is a smaller chunk out of a larger script so some variables appear undefined.

Edit: Per comment requests a sample file has been linked. link to hdf file with lat/lon data.

Question 2

Could you please include the code which leads to the creation of the meta_df object, as well as an example input file, so that we can try out your program without errors?

Question 3

Are you sure this code is doing what you expect? While reading it, I am beginning to have some doubts.

Question 4

Yep will post shortly once laptop is back on wifi. Are we allowed to upload files?... Demo file with lat/lon is small (4kb).

Question 5

You probably should upload the data file somewhere else and provide a link in the question.

Question 6

Yeah i got the feeling too, that your code isn't doing what you expect . For example :for lat_check in lat_list_str_copy_1 you are looping through the copy and there isn't any need to do that since you are only using the final item of the array, because you redefine each time your new array lat_list_str3. Same with your other fors.

Question 7

Bug and accidental quadratic runtime

for lat_check in lat_list_str_copy_1:
 if letter & set(lat_check):
 lat_list_str3 = [i.replace('-', 'S') for i in lat_list_str_copy_1]
 else:
 lat_list_str3 = ['N' + lat_addchar for lat_addchar in lat_list_str_copy_1]

This code (which occurs with variations four times in your program) is probably the cause for the program becoming extremely slow for large input files:

Inside the for loop you want to format only the current latitude value, but instead the whole list is rewritten each time. If you have a list with n = 1000 items, the inner code (i.replace('-', 'S') or 'N' + lat_addchar) is executed n² = 1 000 000 times, of which the first 999 000 times were wasted effort.

In addition, the result is wrong, because you either replace '-' by 'S' in all items, or prepend 'N' to all items, depending of the value of only the last item in the list.

Side note

letter = set('-')
if letter & set(lat_check):
 ...

This is actually a clever idea, if you wanted to check if any of several letters is contained in a string. However, for a single letter, you can simply write the following:

if '-' in lat_check:
 ...

(But checking for negative values can be done more easily, see below.)

Unused results

for i in lat:
 if len(i) < 4:
 i = i + '0'
for i in lon:
 if len(i) < 5:
 i = '0' + i

for i in short_lat:
 if i % 3 > 0:
 i -= 1
for i in short_lon:
 if i % 3 > 0:
 i -= 1

These have no effect and can simply be removed, because the resulting value of i is not used anywhere and is overwritten in the next loop iteration anyway.

meta_df['y'] = lat
meta_df['x'] = lon
meta_df['lon_list_str'] = lon_list_str
meta_df['lat_list_str'] = lat_list_str

This can be removed as well, because meta_df isn't used anywhere later on.

Since now the lists lat and lon aren't used anymore, all the computations leading to their creation can be removed, too. This involves the intermediate lists lat_list_str3 and lon_list_str3, as well as lat_list_str_copy_1 and lon_list_str_copy_1.

Assignments don't make copies

# make copies to avoid contamination in for loop processing
lat_list_str_copy_1 = lat_list_str
lat_list_str_copy_2 = lat_list_str
[...]
lon_list_str_copy_1 = lon_list_str
lon_list_str_copy_2 = lon_list_str

Besides lat_list_str_copy_1 and lon_list_str_copy_1 now not being used anymore, this doesn't do what you think:

These assignments only attach new names to the existing list objects, the do not create new copies of the lists!

However, making copies wouldn't have been necessary at all, because they were never modified.

You can just use the original name of the list wherever the *_copy_1 or *_copy_2 names are used.

This does create a new list, as you intended:

lat_list = []
for lat in meta_df['latitude']:
 lat_list.append(lat)

However, it could more simply be written as

lat_list = list(meta_df['latitude'])

(But it turns out that creating this list wasn't necessary either, see below.)

If you are astonished by how assignments don't make copies in Python, you can read a good explanation in this blog post.

Simplified approach

After these first steps of cleaning up, the way the coordinates take through your program becomes more apparent:

h5py.File → pd.DataFrame → list of numbers → NumPy float array → NumPy str array → list of strings (actual processing: remove parts after decimal point) → list of integers → list of strings → list of strings (actual processing: remove minus sign, add N/S/E/W prefix).

This chain is replicated for latitudes and for longitudes.

I hope you can agree with me that the number of intermediate steps seems excessive compared to the actual work done.

I suggest the following approach:

Treat (longitude, latitude) pairs as single objects.
Iterate directly over the contents of the h5py.File object, processing one coordinate at a time (using a generator), instead of storing many coordinates at once in lists or other containers. If h5py is implemented properly, this should lead to low memory consumption regardless of the input file size and make manual management of chunk processing unnecessary.
Perform only a single conversion, namely when formatting the numeric value of a coordinate as the string that is used as the name of a new directory.
Organize each task of which the program is composed into its own function. This has the following benefits:
- It is easier for future readers of the code to understand what the program does.
- It is easier to test whether the individual task is implemented correctly.
- It forces you to think more thoroughly about your program, leading to simpler code.
- New functionality can be added by adding more functions and reusing existing functions, instead of duplicating and modifying the same code, which is tedious and error-prone.

Here is how would write the program:

from itertools import izip
import errno
import h5py
import os
def generate_coordinates(input_filename):
 """Generate all (longitude, latitude) pairs contained in the input file."""
 with h5py.File(input_filename, 'r') as hf:
 for coordinate in izip(hf['longitude'], hf['latitude']):
 yield coordinate
def format_coordinate(coordinate):
 """Format a pair of angles (longitude, latitude) as an 8-character string:
 >>> format_coordinate((0.5, 12.3))
 'E000N012'
 >>> format_coordinate((-40.23, -138.652))
 'W040S138'
 """
 def format_parts():
 fmt = '{prefix}{value:03d}'
 for angle, directions in zip(coordinate, ['EW', 'NS']):
 if angle >= 0:
 yield fmt.format(prefix=directions[0], value=int(angle))
 else:
 yield fmt.format(prefix=directions[1], value=int(-angle))
 return ''.join(format_parts())
def make_sure_path_exists(path, verbose=True):
 """Create the directory at the specified path if it doesn't exist.
 If verbose, print the name of the created directory.
 """
 try:
 os.makedirs(path)
 except OSError as exception:
 if exception.errno != errno.EEXIST:
 raise
 else:
 if verbose:
 print 'created directory {}'.format(path)
if __name__ == '__main__':
 for c in generate_coordinates('/path/to/input_file'):
 dir_name = format_coordinate(c)
 make_sure_path_exists(os.path.join('/output/dir', dir_name))

Remarks

The docstrings at the beginning of the functions are used by the help system of Python, e.g. when using the help function in an interactive Python session, or when running the pydoc command from the command line. In addition, they can contain doctests, which not only provide examples for how to use a function, but can also be automatically executed in order to test the function.
Using the with syntax ensures that the input file is closed when the program is finished (h5py.File supports it).
Using izip in generate_coordinates prevents loading the entire file contents at once. If you are using Python 3, the regular zip function already behaves like this.
Inside the format_coordinate function there is a local function format_parts generating the two parts of the result for the latitude and the longitude, which are then joined together. It uses the str.format function to format each angle of the coordinate as a prefix and a three-digit zero-filled number.
I have taken the make_sure_path_exists function from this post on StackOverflow and modified it slightly.
The if __name__ == '__main__' idiom prevents the program to be executed when the file is not run as a script, but imported as a module.
os.path.join is the preferred way to combine paths, as opposed to manual string concatenation, because it is platform-independent and removes the possibility of errors such as combining A and B to AB instead of A/B.

Question 8

Masterclass, I ran the file without your recommended workflow for folder creation and did end up running into a bug not seen during small tests. I incorporated your recommended function and it ran seamlessly. Cheers!

Question 9

EDIT : I highly suggest you check out mkrieger1's answer

One loop for the whole process:

So the reason the whole process takes so long to complete is because you are looping for every operation over all your list again and again when you can simply do it in one go.

letter = set('-')
lat_result = []
lon_result = []
short_lon_result = []
short_lat_result = []

Since we are talking about latitude/longitude pairs we can use that to our advantage and access both lists in one go because we have the same count of items for both lists.

for count in range(len(lat_list_str)):
 # Latitude operation
 lat_check = lat_list_str[count]
 short_lat=int(float(lat_check))
 if short_lat % 3 > 0:
 short_lat -= 1
 short_lat = str(short_lat)

Short latitude has the same sign as lat_check.

 if letter & set(lat_check):
 temp_lat_check = lat_check.replace('-', 'S')
 short_lat = short_lat.replace('-', 'S')
 else:
 temp_lat_check = 'N' + lat_check
 short_lat = 'N' + short_lat
 lat = temp_lat_check.replace('.', '')
 if len(lat) < 5:
 lat = '0'+lat
 lat_result.append(lat)
 short_lat_result.append(short_lat)
 # Longitude operation
 lon_check = lon_list_str[count]
 short_lon=int(float(lon_check))
 if short_lon % 3 > 0:
 short_lon -= 1
 short_lon=str(short_lon)

Same goes for longitude, it has the same sign as lon_check.

 if letter & set(lon_check):
 temp_lon_check = lon_check.replace('-', 'W')
 short_lon = short_lon.replace('-', 'W')
 else:
 temp_lon_check = 'E' + lon_check
 short_lon = 'E' + short_lon
 lon = temp_lon_check.replace('.', '')
 if len(lon) < 5:
 lon = '0'+lon
 lon_result.append(lon)
 short_lon_result.append(short_lon)
meta_df['y'] = lat_result
meta_df['x'] = lon_result

To be honest I don't understand why you would store to your dataframe lon_list_str and lat_list_str again when they have the exact same values as your longitude and latitude keys. The following part seems unneeded to me:

meta_df['lon_list_str'] = lon_list_str
meta_df['lat_list_str'] = lat_list_str

Question 10

Yep one for loop makes a lot more sense, thanks! I did end up taking out if short_lon % 3 > 0: short_lon -= 1 because it kept changing the coords from 179 to 180, but other than that it worked really well!

mkrieger1 mkrieger1 1,7841 gold badge14 silver badges26 bronze badges · Accepted Answer · 2017-01-06 20:34:46Z

Bug and accidental quadratic runtime

for lat_check in lat_list_str_copy_1:
 if letter & set(lat_check):
 lat_list_str3 = [i.replace('-', 'S') for i in lat_list_str_copy_1]
 else:
 lat_list_str3 = ['N' + lat_addchar for lat_addchar in lat_list_str_copy_1]

This code (which occurs with variations four times in your program) is probably the cause for the program becoming extremely slow for large input files:

Inside the for loop you want to format only the current latitude value, but instead the whole list is rewritten each time. If you have a list with n = 1000 items, the inner code (i.replace('-', 'S') or 'N' + lat_addchar) is executed n² = 1 000 000 times, of which the first 999 000 times were wasted effort.

In addition, the result is wrong, because you either replace '-' by 'S' in all items, or prepend 'N' to all items, depending of the value of only the last item in the list.

Side note

letter = set('-')
if letter & set(lat_check):
 ...

This is actually a clever idea, if you wanted to check if any of several letters is contained in a string. However, for a single letter, you can simply write the following:

if '-' in lat_check:
 ...

(But checking for negative values can be done more easily, see below.)

Unused results

for i in lat:
 if len(i) < 4:
 i = i + '0'
for i in lon:
 if len(i) < 5:
 i = '0' + i

for i in short_lat:
 if i % 3 > 0:
 i -= 1
for i in short_lon:
 if i % 3 > 0:
 i -= 1

These have no effect and can simply be removed, because the resulting value of i is not used anywhere and is overwritten in the next loop iteration anyway.

meta_df['y'] = lat
meta_df['x'] = lon
meta_df['lon_list_str'] = lon_list_str
meta_df['lat_list_str'] = lat_list_str

This can be removed as well, because meta_df isn't used anywhere later on.

Since now the lists lat and lon aren't used anymore, all the computations leading to their creation can be removed, too. This involves the intermediate lists lat_list_str3 and lon_list_str3, as well as lat_list_str_copy_1 and lon_list_str_copy_1.

Assignments don't make copies

# make copies to avoid contamination in for loop processing
lat_list_str_copy_1 = lat_list_str
lat_list_str_copy_2 = lat_list_str
[...]
lon_list_str_copy_1 = lon_list_str
lon_list_str_copy_2 = lon_list_str

Besides lat_list_str_copy_1 and lon_list_str_copy_1 now not being used anymore, this doesn't do what you think:

These assignments only attach new names to the existing list objects, the do not create new copies of the lists!

However, making copies wouldn't have been necessary at all, because they were never modified.

You can just use the original name of the list wherever the *_copy_1 or *_copy_2 names are used.

This does create a new list, as you intended:

lat_list = []
for lat in meta_df['latitude']:
 lat_list.append(lat)

However, it could more simply be written as

lat_list = list(meta_df['latitude'])

(But it turns out that creating this list wasn't necessary either, see below.)

If you are astonished by how assignments don't make copies in Python, you can read a good explanation in this blog post.

Simplified approach

After these first steps of cleaning up, the way the coordinates take through your program becomes more apparent:

h5py.File → pd.DataFrame → list of numbers → NumPy float array → NumPy str array → list of strings (actual processing: remove parts after decimal point) → list of integers → list of strings → list of strings (actual processing: remove minus sign, add N/S/E/W prefix).

This chain is replicated for latitudes and for longitudes.

I hope you can agree with me that the number of intermediate steps seems excessive compared to the actual work done.

I suggest the following approach:

Treat (longitude, latitude) pairs as single objects.
Iterate directly over the contents of the h5py.File object, processing one coordinate at a time (using a generator), instead of storing many coordinates at once in lists or other containers. If h5py is implemented properly, this should lead to low memory consumption regardless of the input file size and make manual management of chunk processing unnecessary.
Perform only a single conversion, namely when formatting the numeric value of a coordinate as the string that is used as the name of a new directory.
Organize each task of which the program is composed into its own function. This has the following benefits:
- It is easier for future readers of the code to understand what the program does.
- It is easier to test whether the individual task is implemented correctly.
- It forces you to think more thoroughly about your program, leading to simpler code.
- New functionality can be added by adding more functions and reusing existing functions, instead of duplicating and modifying the same code, which is tedious and error-prone.

Here is how would write the program:

from itertools import izip
import errno
import h5py
import os
def generate_coordinates(input_filename):
 """Generate all (longitude, latitude) pairs contained in the input file."""
 with h5py.File(input_filename, 'r') as hf:
 for coordinate in izip(hf['longitude'], hf['latitude']):
 yield coordinate
def format_coordinate(coordinate):
 """Format a pair of angles (longitude, latitude) as an 8-character string:
 >>> format_coordinate((0.5, 12.3))
 'E000N012'
 >>> format_coordinate((-40.23, -138.652))
 'W040S138'
 """
 def format_parts():
 fmt = '{prefix}{value:03d}'
 for angle, directions in zip(coordinate, ['EW', 'NS']):
 if angle >= 0:
 yield fmt.format(prefix=directions[0], value=int(angle))
 else:
 yield fmt.format(prefix=directions[1], value=int(-angle))
 return ''.join(format_parts())
def make_sure_path_exists(path, verbose=True):
 """Create the directory at the specified path if it doesn't exist.
 If verbose, print the name of the created directory.
 """
 try:
 os.makedirs(path)
 except OSError as exception:
 if exception.errno != errno.EEXIST:
 raise
 else:
 if verbose:
 print 'created directory {}'.format(path)
if __name__ == '__main__':
 for c in generate_coordinates('/path/to/input_file'):
 dir_name = format_coordinate(c)
 make_sure_path_exists(os.path.join('/output/dir', dir_name))

Remarks

The docstrings at the beginning of the functions are used by the help system of Python, e.g. when using the help function in an interactive Python session, or when running the pydoc command from the command line. In addition, they can contain doctests, which not only provide examples for how to use a function, but can also be automatically executed in order to test the function.
Using the with syntax ensures that the input file is closed when the program is finished (h5py.File supports it).
Using izip in generate_coordinates prevents loading the entire file contents at once. If you are using Python 3, the regular zip function already behaves like this.
Inside the format_coordinate function there is a local function format_parts generating the two parts of the result for the latitude and the longitude, which are then joined together. It uses the str.format function to format each angle of the coordinate as a prefix and a three-digit zero-filled number.
I have taken the make_sure_path_exists function from this post on StackOverflow and modified it slightly.
The if __name__ == '__main__' idiom prevents the program to be executed when the file is not run as a script, but imported as a module.
os.path.join is the preferred way to combine paths, as opposed to manual string concatenation, because it is platform-independent and removes the possibility of errors such as combining A and B to AB instead of A/B.

Masterclass, I ran the file without your recommended workflow for folder creation and did end up running into a bug not seen during small tests. I incorporated your recommended function and it ran seamlessly. Cheers!

Stack Exchange Network

Split latitude/longitude by degree to make file names and folder directory names

2 Answers 2

Bug and accidental quadratic runtime

Unused results

Assignments don't make copies

Simplified approach

Remarks

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Split latitude/longitude by degree to make file names and folder directory names

2 Answers 2

Bug and accidental quadratic runtime

Unused results

Assignments don't make copies

Simplified approach

Remarks

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions