Bug and accidental quadratic runtime
for lat_check in lat_list_str_copy_1: if letter & set(lat_check): lat_list_str3 = [i.replace('-', 'S') for i in lat_list_str_copy_1] else: lat_list_str3 = ['N' + lat_addchar for lat_addchar in lat_list_str_copy_1]
This code (which occurs with variations four times in your program) is probably the cause for the program becoming extremely slow for large input files:
Inside the for
loop you want to format only the current latitude value, but instead the whole list is rewritten each time. If you have a list with n = 1000 items, the inner code (i.replace('-', 'S')
or 'N' + lat_addchar
) is executed n2 = 1 000 000 times, of which the first 999 000 times were wasted effort.
In addition, the result is wrong, because you either replace '-'
by 'S'
in all items, or prepend 'N'
to all items, depending of the value of only the last item in the list.
Unused results
for i in lat: if len(i) < 4: i = i + '0' for i in lon: if len(i) < 5: i = '0' + i
for i in short_lat: if i % 3 > 0: i -= 1 for i in short_lon: if i % 3 > 0: i -= 1
These have no effect and can simply be removed, because the resulting value of i
is not used anywhere and is overwritten in the next loop iteration anyway.
meta_df['y'] = lat meta_df['x'] = lon meta_df['lon_list_str'] = lon_list_str meta_df['lat_list_str'] = lat_list_str
This can be removed as well, because meta_df
isn't used anywhere later on.
Since now the lists lat
and lon
aren't used anymore, all the computations leading to their creation can be removed, too. This involves the intermediate lists lat_list_str3
and lon_list_str3
, as well as lat_list_str_copy_1
and lon_list_str_copy_1
.
Assignments don't make copies
# make copies to avoid contamination in for loop processing lat_list_str_copy_1 = lat_list_str lat_list_str_copy_2 = lat_list_str [...] lon_list_str_copy_1 = lon_list_str lon_list_str_copy_2 = lon_list_str
Besides lat_list_str_copy_1
and lon_list_str_copy_1
now not being used anymore, this doesn't do what you think:
These assignments only attach new names to the existing list objects, the do not create new copies of the lists!
However, making copies wouldn't have been necessary at all, because they were never modified.
You can simply use the original name of the list wherever the *_copy_1
or *_copy_2
names are used.
Simplified approach
After these first steps of cleaning up, the way the coordinates take through your program becomes more apparent:
h5py.File
→
pd.DataFrame
→
list of numbers →
NumPy float
array →
NumPy str
array →
list of strings (actual processing: remove parts after decimal point) →
list of integers →
list of strings →
list of strings (actual processing: remove minus sign, add N/S/E/W prefix).
This chain is replicated for latitudes and for longitudes.
I hope you can agree with me that the number of intermediate steps seems excessive compared to the actual work done.
I suggest the following approach:
Treat (longitude, latitude) pairs as single objects.
Iterate directly over the contents of the
h5py.File
object, processing one coordinate at a time (using a generator), instead of storing many coordinates at once in lists or other containers. Ifh5py
is implemented properly, this should lead to low memory consumption regardless of the input file size and make manual management of chunk processing unnecessary.Perform only a single conversion, namely when formatting the numeric value of a coordinate as the string that is used as the name of a new directory.
Organize each task of which the program is composed into its own function. This has the following benefits:
- It is easier for future readers of the code to understand what the program does.
- It is easier to test whether the individual task is implemented correctly.
- It forces you to think more thoroughly about your program, leading to simpler code.
- New functionality can be added by adding more functions and reusing existing functions, instead of duplicating and modifying the same code, which is tedious and error-prone.
Here is how would write the program:
from itertools import izip
import errno
import h5py
import os
def generate_coordinates(input_filename):
"""Generate all (longitude, latitude) pairs contained in the input file."""
with h5py.File(input_filename, 'r') as hf:
for coordinate in izip(hf['longitude'], hf['latitude']):
yield coordinate
def format_coordinate(coordinate):
"""Format a pair of numbers (longitude, latitude) as an 8-character string:
>>> format_coordinate((0.5, 12.3))
'E000N012'
>>> format_coordinate((-40.23, -138.652))
'W040S138'
"""
def format_parts():
fmt = '{prefix}{value:03d}'
for angle, directions in zip(coordinate, ['EW', 'NS']):
if angle >= 0:
yield fmt.format(prefix=directions[0], value=int(angle))
else:
yield fmt.format(prefix=directions[1], value=int(-angle))
return ''.join(format_parts())
def make_sure_path_exists(path, verbose=True):
"""Create the directory at the specified path if it doesn't exist.
If verbose, print the name of the created directory.
"""
try:
os.makedirs(path)
except OSError as exception:
if exception.errno != errno.EEXIST:
raise
else:
if verbose:
print 'created directory {}'.format(path)
if __name__ == '__main__':
for c in generate_coordinates('/path/to/input_file'):
dir_name = format_coordinate(c)
make_sure_path_exists(os.path.join('/output/dir', dir_name))
Remarks
The docstrings at the beginning of the functions are used by the help system of Python, e.g. when using the
help
function in an interactive Python session, or when running thepydoc
command from the command line. In addition, they can contain doctests, which not only provide examples for how to use a function, but can also be automatically executed in order to test the function.Using the
with
syntax ensures that the input file is closed when the program is finished (h5py.File
supports it).Using
izip
ingenerate_coordinates
prevents loading the entire file contents at once. If you are using Python 3, the regularzip
function already behaves like this.Inside the
format_coordinate
function there is a local functionformat_parts
generating the two parts of the result for the latitude and the longitude, which are then joined together. It uses thestr.format
function to format each angle of the coordinate as a prefix and a three-digit zero-filled number.I have taken the
make_sure_path_exists
function from this post on StackOverflow and modified it slightly.The
if __name__ == '__main__'
idiom prevents the program to be executed when the file is not run as a script, but imported as a module.os.path.join
is the preferred way to combine paths, as opposed to manual string concatenation, because it is platform-independent and removes the possibility of errors such as combiningA
andB
toAB
instead ofA/B
.
- 1.8k
- 1
- 14
- 26