I have written my first Python script that will go into live use at my workplace. I am relatively new to Python and I think this script is pretty good but I would like your input.
It runs very fast, as far as I can tell (~0.01seconds). It is also my first time writing a script that takes arguments from the way it is called, and these arguments are set
in the .bat file it is ran from.
import os
import time
from itertools import izip_longest
import fnmatch
import datetime
from datetime import timedelta, datetime
import shutil
import time
from sys import argv
date, raw_dir, today_dir, archive_dir, error_file = argv[1:]
def merge(raw_dir, today_dir, archive_dir, date):
"""Read through all .CSVs in RAW directory and merge rows into Today's "full" CSV """
archive_file = archive_dir + date + '.csv'
with open(today_dir + 'TODAY.CSV', 'w') as f_obj:
rows = []
files = os.listdir(raw_dir)
for fi in files:
if fnmatch.fnmatch(fi, '*.csv') and not fnmatch.fnmatch(fi, 'TODAY.CSV'):
rows.append(open(os.path.join(raw_dir, fi)).readlines())
iter = izip_longest(*rows)
for row in iter:
f_obj.write(' '.join([field.strip() for field in row if field is not None]) + '\n')
shutil.copyfile(today_dir + 'TODAY.CSV', archive_dir + date + '.csv')
def clear(raw_dir):
""" remove any raw csv files that are older than 14 business days. Data itself
is not lost because it has been archived in the full csv"""
files = os.listdir(raw_dir)
file_paths = []
file_ints = []
for f in files:
file_paths.append(os.path.join(raw_dir, f))
if len(file_paths) > 14:
for f in files:
file_ints.append(int(f[:8])) #converts file names to ints from datenumber
del_file = raw_dir + str(min(file_ints)) + '.csv' #makes a valid file path from of the lowest value datenumber
os.remove(del_file) # deletes lowest value
print del_file + 'was deleted in clean up'
else:
return
I've left out the part with the functions actually being called as it's just simple if __name__ == "__main__"
and some try/except
2 Answers 2
Clear
List Comprehension
It's an unneeded habit in Python to pre-declare your variables. Instead of
myList = []
for x in otherList:
myList.append(foo(x))
you can use list comprehension,
myList = [foo(x) for x in otherList]
# identical: myList = map(foo, otherList)
Clarity
for f in files: file_ints.append(int(f[:8])) #converts file names to ints from datenumber
The comment explains that f[:8]
returns the date number of the file but how that happens is a mystery to me. Why don't you just use os.path.getctime
?
file_ints = [os.path.getctime(f) for f in files]
This next bit is confusing.
# makes a valid file path from of the lowest value datenumber del_file = raw_dir + str(min(file_ints)) + '.csv'
If only makes sense if your files' naming convention is based on their ctime. You aren't defining that naming convention here. Don't rely on external rules. Your file_ints
line up with your file_paths
so just use
del_file = file_paths(min(file_ints))
You might consider maintaining the alignment by zipping the two together like this
path_mtime = zip(file_paths, file_ints)
Assumptions
Do you get the files daily? Is the script guaranteed to be run daily? If both of these are ever false then your clearing will be out of sync. You are not deleting every file older than 14 days, you are deleting the oldest file if there are more than 14. It could be 15 days old, it could be 3 day's old.
Instead, iterate over all files and delete anything older than 14 days.
for path, ctime in zip(file_paths, file_ints):
if is_too_old(ctime): # implement externally. use time module to get today.
os.remove(path)
You are also assuming all files in this directory are .csv
. I don't know the nature of the directory but it could be possible that some jerk comes and drops a .txt or .xls in your directory. os.listDir
will also return any subfolders that you might have. You might not have them but if you did you surely don't want to remove them. Using glob
might be the safer bet. It will also return the full path so you don't need the intermediate files
variable.
file_paths = glob.glob(os.path.join(raw_dir, "*.csv"))
This is optional though. It might be more than fair to assume what is or isn't in your folder.
Results
import glob
def clear(raw_dir):
""" remove any raw csv files that are older than 14 business days. Data itself
is not lost because it has been archived in the full csv"""
file_paths = glob.glob(os.path.join(raw_dir, "*.csv"))
# Optional early exit assuming you get no less than 14 files a day.
# if len(file_paths ) <= 14:
# return
file_ints = [os.path.getctime(f) for f in file_paths]
for path, ctime in zip(file_paths, file_ints):
if is_too_old(ctime): # implement externally. Use time module to get today.
os.remove(path)
print path + ' was deleted in clean up'
Merge
You declare this variable, and then never use it,
archive_file = archive_dir + date + '.csv'
the same value is used at the end,
shutil.copyfile(today_dir + 'TODAY.CSV', archive_dir + date + '.csv')
but I say drop the declaration and leave the end as is, except please use os.path.join
Order
The main issue with this procedure is the order you perform events. Currently, it is
- open output file
- read input files
- write output file.
It should be,
- read input files
- open output file
- write output file.
reading before you create the output file will simplify the read to,
rows = []
files = os.listdir(raw_dir)
for fi in files:
if fnmatch.fnmatch(fi, '*.csv'):
rows.append(open(os.path.join(raw_dir, fi)).readlines())
using glob
will make it simpler,
rows = \
[open(fi).readlines() for fi in glob.glob(os.path.join(raw_dir, "*.csv")]
Clarity
I have no idea why you are transposing your rows.
iter = izip_longest(*rows)
Each row is the list of lines of a file so
rows = [[line1_file1, line2_file1, ...], [line1_file2, line2_file2, ...], ...]
then you transpose that, so you get
iter = [[line1_file1, line1_file2, ...], [line2_file1, line2_file2, ...]]
So then each line of your output is a join of the non-null lines of your files. Are you sure that's what you want? I could help you more but I don't understand the process you are going through. Your output file looks like it could be
line1_file1 line1_file2 line1_file3 line2_file1 line2_file2 line2_file3 line3_file1 line3_file3 line4_file3
Point is it's confusing and there should be some explanation.
Results
def merge(raw_dir, today_dir, archive_dir, date):
"""Read through all .CSVs in RAW directory and merge rows into Today's "full" CSV
"""
rows = \
[open(fi).readlines() for fi in glob.glob(os.path.join(raw_dir, "*.csv")]
merged_file = os.path.join(today_dir, 'TODAY.CSV')
archive_file = os.path.join(archive_dir, date + '.csv')
with open(merged_file, 'w') as f_obj:
# whatever is going on here...
for row in izip_longest(*rows):
f_obj.write(' '.join([field.strip() for field in row if field is not None]) + '\n')
shutil.copyfile(merged_file, archive_file)
-
\$\begingroup\$ The reason for doing it using smallest number is because it is only one file a day, and the process producing the .CSVs is run on a highly monitored and maintained system, it will never be allowed to not run. In regards to making sure only CSVs are in there, I have added a
valid()
function, which validates the contents and format of all directories and path names before the main script is actually run. \$\endgroup\$Joe Smart– Joe Smart2014年09月23日 08:25:23 +00:00Commented Sep 23, 2014 at 8:25
Would use os.path.join everywhere, doing e.g. "today_dir + 'TODAY.CSV'" is error prone, as it will create an invalid path if e.g. 'today_dir' has no '\' at the end. I would also ensure that the input values (today_dir, archive_dir etc) contain valid values.
-
\$\begingroup\$ Good ideas there! For checking the input values I assume you mean check if they're all file paths or directories? \$\endgroup\$Joe Smart– Joe Smart2014年09月22日 10:38:02 +00:00Commented Sep 22, 2014 at 10:38
-
\$\begingroup\$ Yes and make sure they exist. \$\endgroup\$Werner– Werner2014年09月23日 12:33:18 +00:00Commented Sep 23, 2014 at 12:33