Concatenating rows in CSV files and removing old CSV files

Question 1

I have written my first Python script that will go into live use at my workplace. I am relatively new to Python and I think this script is pretty good but I would like your input.

It runs very fast, as far as I can tell (~0.01seconds). It is also my first time writing a script that takes arguments from the way it is called, and these arguments are set in the .bat file it is ran from.

import os
import time
from itertools import izip_longest
import fnmatch
import datetime
from datetime import timedelta, datetime
import shutil
import time
from sys import argv
date, raw_dir, today_dir, archive_dir, error_file = argv[1:]
def merge(raw_dir, today_dir, archive_dir, date):
 """Read through all .CSVs in RAW directory and merge rows into Today's "full" CSV """
 archive_file = archive_dir + date + '.csv'
 with open(today_dir + 'TODAY.CSV', 'w') as f_obj:
 rows = []
 files = os.listdir(raw_dir)
 for fi in files:
 if fnmatch.fnmatch(fi, '*.csv') and not fnmatch.fnmatch(fi, 'TODAY.CSV'):
 rows.append(open(os.path.join(raw_dir, fi)).readlines())
 iter = izip_longest(*rows)
 for row in iter:
 f_obj.write(' '.join([field.strip() for field in row if field is not None]) + '\n')
 shutil.copyfile(today_dir + 'TODAY.CSV', archive_dir + date + '.csv')
def clear(raw_dir):
 """ remove any raw csv files that are older than 14 business days. Data itself
 is not lost because it has been archived in the full csv"""
 files = os.listdir(raw_dir)
 file_paths = []
 file_ints = []
 for f in files:
 file_paths.append(os.path.join(raw_dir, f))
 if len(file_paths) > 14:
 for f in files:
 file_ints.append(int(f[:8])) #converts file names to ints from datenumber
 del_file = raw_dir + str(min(file_ints)) + '.csv' #makes a valid file path from of the lowest value datenumber
 os.remove(del_file) # deletes lowest value
 print del_file + 'was deleted in clean up'
 else:
 return

I've left out the part with the functions actually being called as it's just simple if __name__ == "__main__" and some try/except

Question 2

Clear

List Comprehension

It's an unneeded habit in Python to pre-declare your variables. Instead of

myList = []
for x in otherList:
 myList.append(foo(x))

you can use list comprehension,

myList = [foo(x) for x in otherList]
# identical: myList = map(foo, otherList)

Clarity

for f in files:
 file_ints.append(int(f[:8])) #converts file names to ints from datenumber

The comment explains that f[:8] returns the date number of the file but how that happens is a mystery to me. Why don't you just use os.path.getctime?

file_ints = [os.path.getctime(f) for f in files]

This next bit is confusing.

# makes a valid file path from of the lowest value datenumber
del_file = raw_dir + str(min(file_ints)) + '.csv'

If only makes sense if your files' naming convention is based on their ctime. You aren't defining that naming convention here. Don't rely on external rules. Your file_ints line up with your file_paths so just use

del_file = file_paths(min(file_ints))

You might consider maintaining the alignment by zipping the two together like this

path_mtime = zip(file_paths, file_ints)

Assumptions

Do you get the files daily? Is the script guaranteed to be run daily? If both of these are ever false then your clearing will be out of sync. You are not deleting every file older than 14 days, you are deleting the oldest file if there are more than 14. It could be 15 days old, it could be 3 day's old.

Instead, iterate over all files and delete anything older than 14 days.

for path, ctime in zip(file_paths, file_ints):
 if is_too_old(ctime): # implement externally. use time module to get today.
 os.remove(path)

You are also assuming all files in this directory are .csv. I don't know the nature of the directory but it could be possible that some jerk comes and drops a .txt or .xls in your directory. os.listDir will also return any subfolders that you might have. You might not have them but if you did you surely don't want to remove them. Using glob might be the safer bet. It will also return the full path so you don't need the intermediate files variable.

file_paths = glob.glob(os.path.join(raw_dir, "*.csv"))

This is optional though. It might be more than fair to assume what is or isn't in your folder.

Results

import glob
def clear(raw_dir):
 """ remove any raw csv files that are older than 14 business days. Data itself
 is not lost because it has been archived in the full csv"""
 file_paths = glob.glob(os.path.join(raw_dir, "*.csv"))
 # Optional early exit assuming you get no less than 14 files a day.
 # if len(file_paths ) <= 14:
 # return
 file_ints = [os.path.getctime(f) for f in file_paths]
 for path, ctime in zip(file_paths, file_ints):
 if is_too_old(ctime): # implement externally. Use time module to get today.
 os.remove(path)
 print path + ' was deleted in clean up'

Merge

You declare this variable, and then never use it,

archive_file = archive_dir + date + '.csv'

the same value is used at the end,

shutil.copyfile(today_dir + 'TODAY.CSV', archive_dir + date + '.csv')

but I say drop the declaration and leave the end as is, except please use os.path.join

Order

The main issue with this procedure is the order you perform events. Currently, it is

open output file
read input files
write output file.

It should be,

read input files
open output file
write output file.

reading before you create the output file will simplify the read to,

rows = []
files = os.listdir(raw_dir)
for fi in files:
 if fnmatch.fnmatch(fi, '*.csv'):
 rows.append(open(os.path.join(raw_dir, fi)).readlines())

using glob will make it simpler,

rows = \
 [open(fi).readlines() for fi in glob.glob(os.path.join(raw_dir, "*.csv")]

Clarity

I have no idea why you are transposing your rows.

iter = izip_longest(*rows)

Each row is the list of lines of a file so

rows = [[line1_file1, line2_file1, ...], [line1_file2, line2_file2, ...], ...]

then you transpose that, so you get

iter = [[line1_file1, line1_file2, ...], [line2_file1, line2_file2, ...]]

So then each line of your output is a join of the non-null lines of your files. Are you sure that's what you want? I could help you more but I don't understand the process you are going through. Your output file looks like it could be

line1_file1 line1_file2 line1_file3
line2_file1 line2_file2 line2_file3
line3_file1 line3_file3
line4_file3

Point is it's confusing and there should be some explanation.

Results

def merge(raw_dir, today_dir, archive_dir, date):
 """Read through all .CSVs in RAW directory and merge rows into Today's "full" CSV
 """
 rows = \
 [open(fi).readlines() for fi in glob.glob(os.path.join(raw_dir, "*.csv")]
 merged_file = os.path.join(today_dir, 'TODAY.CSV')
 archive_file = os.path.join(archive_dir, date + '.csv')
 with open(merged_file, 'w') as f_obj:
 # whatever is going on here...
 for row in izip_longest(*rows):
 f_obj.write(' '.join([field.strip() for field in row if field is not None]) + '\n')
 shutil.copyfile(merged_file, archive_file)

Question 3

The reason for doing it using smallest number is because it is only one file a day, and the process producing the .CSVs is run on a highly monitored and maintained system, it will never be allowed to not run. In regards to making sure only CSVs are in there, I have added a valid() function, which validates the contents and format of all directories and path names before the main script is actually run.

Question 4

Would use os.path.join everywhere, doing e.g. "today_dir + 'TODAY.CSV'" is error prone, as it will create an invalid path if e.g. 'today_dir' has no '\' at the end. I would also ensure that the input values (today_dir, archive_dir etc) contain valid values.

Question 5

Good ideas there! For checking the input values I assume you mean check if they're all file paths or directories?

Question 6

Yes and make sure they exist.

score 4 · Answer 1 · 2014-09-22 20:35:28Z

Clear

List Comprehension

It's an unneeded habit in Python to pre-declare your variables. Instead of

myList = []
for x in otherList:
 myList.append(foo(x))

you can use list comprehension,

myList = [foo(x) for x in otherList]
# identical: myList = map(foo, otherList)

Clarity

for f in files:
 file_ints.append(int(f[:8])) #converts file names to ints from datenumber

The comment explains that f[:8] returns the date number of the file but how that happens is a mystery to me. Why don't you just use os.path.getctime?

file_ints = [os.path.getctime(f) for f in files]

This next bit is confusing.

# makes a valid file path from of the lowest value datenumber
del_file = raw_dir + str(min(file_ints)) + '.csv'

If only makes sense if your files' naming convention is based on their ctime. You aren't defining that naming convention here. Don't rely on external rules. Your file_ints line up with your file_paths so just use

del_file = file_paths(min(file_ints))

You might consider maintaining the alignment by zipping the two together like this

path_mtime = zip(file_paths, file_ints)

Assumptions

Do you get the files daily? Is the script guaranteed to be run daily? If both of these are ever false then your clearing will be out of sync. You are not deleting every file older than 14 days, you are deleting the oldest file if there are more than 14. It could be 15 days old, it could be 3 day's old.

Instead, iterate over all files and delete anything older than 14 days.

for path, ctime in zip(file_paths, file_ints):
 if is_too_old(ctime): # implement externally. use time module to get today.
 os.remove(path)

You are also assuming all files in this directory are .csv. I don't know the nature of the directory but it could be possible that some jerk comes and drops a .txt or .xls in your directory. os.listDir will also return any subfolders that you might have. You might not have them but if you did you surely don't want to remove them. Using glob might be the safer bet. It will also return the full path so you don't need the intermediate files variable.

file_paths = glob.glob(os.path.join(raw_dir, "*.csv"))

This is optional though. It might be more than fair to assume what is or isn't in your folder.

Results

import glob
def clear(raw_dir):
 """ remove any raw csv files that are older than 14 business days. Data itself
 is not lost because it has been archived in the full csv"""
 file_paths = glob.glob(os.path.join(raw_dir, "*.csv"))
 # Optional early exit assuming you get no less than 14 files a day.
 # if len(file_paths ) <= 14:
 # return
 file_ints = [os.path.getctime(f) for f in file_paths]
 for path, ctime in zip(file_paths, file_ints):
 if is_too_old(ctime): # implement externally. Use time module to get today.
 os.remove(path)
 print path + ' was deleted in clean up'

Merge

You declare this variable, and then never use it,

archive_file = archive_dir + date + '.csv'

the same value is used at the end,

shutil.copyfile(today_dir + 'TODAY.CSV', archive_dir + date + '.csv')

but I say drop the declaration and leave the end as is, except please use os.path.join

Order

The main issue with this procedure is the order you perform events. Currently, it is

open output file
read input files
write output file.

It should be,

read input files
open output file
write output file.

reading before you create the output file will simplify the read to,

rows = []
files = os.listdir(raw_dir)
for fi in files:
 if fnmatch.fnmatch(fi, '*.csv'):
 rows.append(open(os.path.join(raw_dir, fi)).readlines())

using glob will make it simpler,

rows = \
 [open(fi).readlines() for fi in glob.glob(os.path.join(raw_dir, "*.csv")]

Clarity

I have no idea why you are transposing your rows.

iter = izip_longest(*rows)

Each row is the list of lines of a file so

rows = [[line1_file1, line2_file1, ...], [line1_file2, line2_file2, ...], ...]

then you transpose that, so you get

iter = [[line1_file1, line1_file2, ...], [line2_file1, line2_file2, ...]]

So then each line of your output is a join of the non-null lines of your files. Are you sure that's what you want? I could help you more but I don't understand the process you are going through. Your output file looks like it could be

line1_file1 line1_file2 line1_file3
line2_file1 line2_file2 line2_file3
line3_file1 line3_file3
line4_file3

Point is it's confusing and there should be some explanation.

Results

def merge(raw_dir, today_dir, archive_dir, date):
 """Read through all .CSVs in RAW directory and merge rows into Today's "full" CSV
 """
 rows = \
 [open(fi).readlines() for fi in glob.glob(os.path.join(raw_dir, "*.csv")]
 merged_file = os.path.join(today_dir, 'TODAY.CSV')
 archive_file = os.path.join(archive_dir, date + '.csv')
 with open(merged_file, 'w') as f_obj:
 # whatever is going on here...
 for row in izip_longest(*rows):
 f_obj.write(' '.join([field.strip() for field in row if field is not None]) + '\n')
 shutil.copyfile(merged_file, archive_file)

The reason for doing it using smallest number is because it is only one file a day, and the process producing the .CSVs is run on a highly monitored and maintained system, it will never be allowed to not run. In regards to making sure only CSVs are in there, I have added a valid() function, which validates the contents and format of all directories and path names before the main script is actually run.

Werner WernerWerner 1211 bronze badge · Answer 2 · 2014-09-22 10:23:15Z

2

\$\begingroup\$

Would use os.path.join everywhere, doing e.g. "today_dir + 'TODAY.CSV'" is error prone, as it will create an invalid path if e.g. 'today_dir' has no '\' at the end. I would also ensure that the input values (today_dir, archive_dir etc) contain valid values.

Share

answered Sep 22, 2014 at 10:23

Werner's user avatar

Werner WernerWerner

1211 bronze badge

\$\endgroup\$

2

\$\begingroup\$ Good ideas there! For checking the input values I assume you mean check if they're all file paths or directories? \$\endgroup\$

Joe Smart
– Joe Smart

2014年09月22日 10:38:02 +00:00
Commented Sep 22, 2014 at 10:38
\$\begingroup\$ Yes and make sure they exist. \$\endgroup\$

Werner
– Werner

2014年09月23日 12:33:18 +00:00
Commented Sep 23, 2014 at 12:33

Add a comment |

Stack Exchange Network

Concatenating rows in CSV files and removing old CSV files

2 Answers 2

Clear

List Comprehension

Clarity

Assumptions

Results

Merge

Order

Clarity

Results

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Concatenating rows in CSV files and removing old CSV files

2 Answers 2

Clear

List Comprehension

Clarity

Assumptions

Results

Merge

Order

Clarity

Results

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions