I'm a freelancer and I recently wrote a python script that processes input xlsx files that contain data in a particular format and output a csv file in another format as required.
The code works well and fulfills the requirements, but I am looking for a code review with suggestions on performance, improved readability, better organization, better practices etc. It uses openpyxl library, and is quite long, as there are a lot of operations so I'm sharing you paste links below:
- utilities.py
- main.py (the main file)
Here are some bits of code from the file main.py (required packages are imported above):
println("Reading arguments for input directory..")
args = sys.argv
try:
input_dir_name = args[1]
except IndexError:
# invalid syntax used
println("ERROR! Please use correct syntax.")
println("Syntax is python main.py <input folder name> .")
println("Example: python main.py 2014年05月20日")
sys.exit(0)
# Select input and output directory
input_dir = 'input/' + input_dir_name
# Iterate over all files in the input directory
println("Reading input from directory ./" + input_dir + "..")
try:
# Get list of all files
allfiles = [ f for f in listdir(input_dir) if isfile(join(input_dir, f)) ]
except:
# Invalid directory
println("ERROR! The specified input directory was not found.")
println("Please make sure you are running the script in the correct directory.")
sys.exit(0)
# create output folder if output folder doesn't exist
output_dir = 'output/' + input_dir_name
if not os.path.exists(output_dir):
os.makedirs(output_dir)
for f in allfiles:
# initialize variables
matrix = []
rows = cols = i = j = 0
single_rows = break_row = break_col = 0
single_rows_complete = break_word_found = 0
println("New input file found: " + f)
# Get state and file name without extension
f_state = utilities.getFileState(f)
f_name = utilities.removeFileExt(f)
break_word, bw_is_string = estado(f_state)
if (break_word == ""):
println("Omitting input file: " + f + " as no break word was found for this state. State code was: " + f_state)
continue
input_file = input_dir + '/' + f
output_file = output_dir + '/' + f_name + '.csv'
# Open csv writer for output
writer = csv.writer(open(output_file, 'wb'))
println("Output file path: " + repr(output_file))
# Load first worksheet of input file for processing
println("Opening Input Excel Workbook..")
wb = load_workbook(filename = input_file, use_iterators = True)
ws = wb.active
# Count number of rows and columns
println("Counting number of rows and columns..")
for row in ws.iter_rows():
rows = rows + 1
if (rows == 1):
for cell in row:
cols = cols + 1
println("Complete. Total Rows = " + repr(rows) + " and Total Cols = " + repr(cols) + ".")
# define matrix 2d array for storing excel data
matrix = [[0 for x in xrange(cols)] for x in xrange(rows)]
# code continues...
and another..
# Generate output csv heading row
println("Generating Column Names for Output..")
if (break_col == 0): # Content starts from first column
csv_headers = ['Municipios']
else: # Content starts from second column, with first column being S.No.
csv_headers = ['S.No.', 'Municipios']
j = break_col + 1
i = break_row - 1
# loop over columns starting from break col's next col
while (j < cols):
# call the get col label method
col_label = utilities.getRecursiveColLabel(matrix, i, j, single_rows, break_row)
# format it so that there is no _ at the end
col_label = col_label[1:]
if (col_label[-1:] == '_'):
col_label = col_label[:-1]
# println("Column name for col=" + repr(j) + " is: " + col_label)
csv_headers.append(col_label)
j = j + 1 # col increment
# write into output file
writer.writerows([
(csv_headers)
])
println("Column names written to output successfully.")
# start from break row to write actual statistics data
i = break_row
println("Writing Statistics data to output..")
# loop over all rows
while (i < rows):
# this data var will be stored into the csv finally
stats_data = []
# If break word is in first column, that means the names are in the format <int> <str>
# println("In first col: " + repr(matrix[i][0]))
if (break_col == 0):
if (i == break_row):
# if this is the row containing breakword then format is only <str>
# stats_data.append(0)
stats_data.append(matrix[i][0].encode('utf-8'))
else:
# here format is <int> <str>
try:
parts = matrix[i][break_col].split()
try:
if utilities.RepresentsInt(parts[0]):
# stats_data.append(int(parts[0]))
stats_data.append(''.join(parts[1:]).encode('utf-8').strip())
else:
# if format is something else just append the value as it is
stats_data.append(''.join(matrix[i][break_col]).encode('utf-8'))
except:
stats_data.append(''.join(parts).encode('utf-8').strip())
except:
stats_data.append(''.join(str(matrix[i][break_col])))
else:
# if breakword is not in the first column, that means input is properly numbered with 123 in col A
# append directly into output
if (isinstance(matrix[i][0], basestring)):
stats_data.append(''.join(matrix[i][0]).encode('utf-8'))
else:
stats_data.append(matrix[i][0])
if (isinstance(matrix[i][1], basestring)):
stats_data.append(''.join(matrix[i][1]).encode('utf-8'))
else:
stats_data.append(matrix[i][1])
j = break_col + 1
# loop over all columns starting from next the break word col
# this does the actual statistics numbers in the input
while (j < cols):
# format output
# println("Here i= " + repr(i) + " and j=" + repr(j))
c = matrix[i][j]
if (isinstance(c, basestring)):
if (c.strip().lower() == "(x)"): # replace (x) values with 0
stats_data.append(0)
elif (c.strip() == u'\xad'): # special replacement for unicode problems
stats_data.append("-")
else: # if it is a string (-,chars) then it needs to be properly encoded
stats_data.append(c.encode('utf-8'))
elif utilities.RepresentsInt(c): # if the number is an integer
stats_data.append(int(c))
else: # if its not a string just append it as it is
stats_data.append(c)
j = j + 1 # col increment
# write current row to output csv
writer.writerows([
(stats_data)
])
i = i + 1 # row increment
println("Writing output complete successfully.")
println("Closing Input and Output files..")
1 Answer 1
NOTE: I wrote this answer before realizing that you used Python 2 and not Python 3. I don't think that anything in your code prevents you from using Python 3 instead; you only have some small changes to make and you could benefit from the latest cool features.
Use pathlib
If you can use Python 3.4, then you may want to use the new pathlib
module instead of the old os.path
. pathlib
provides a Path
class that achieves the same things as os.path
but in an object-oriented manner. Here is the beginning of your script rewritten with pathlib
(I skipped the irrelevant parts for clarity):
from pathlib import Path
try:
input_dir_name = Path(args[1])
except IndexError:
# ...
# Select input and output directory
input_dir = 'input' / input_dir_name
# Iterate over all files in the input directory
println("Reading input from directory" + str('.' / input_dir / '..'))
try:
# Get list of all files
allfiles = [ f for f in input_dir.iterdir() if f.is_file() ]
except:
# ...
# create output folder if output folder doesn't exist
output_dir = 'output' / input_dir_name
if not output_dir.exists():
output_dir.mkdir()
Working on paths is safer than working on strings since you don't have to handle the separators at all, nor do you have to struggle with drives on Windows. They are added when needed. Moreover, if you need a string to deal with legacy file-handling functions, you can still call str
on your path to get a nice string representation.
Use the with
statement
When doing file IO, you should use the with
statement. That way, you will have a nice identation to represent when a file is opened and when is closed. Moreover, you won't have to close it by hand, it will automatically be done at the end of the indented block:
# Open csv writer for output
with ouput_file.open('wb') as f:
writer = csv.writer(f)
println("Output file path: " + str(output_file))
# ...
# indentation ends, f is closed
Note that I considered that output_file
was a pathlib.Path
in this example too.
-
\$\begingroup\$ Thank you! Yes, I didn't use Python 3 as I am yet to get into its documentation. I will do that soon :) \$\endgroup\$Ananth– Ananth2014年08月13日 12:42:19 +00:00Commented Aug 13, 2014 at 12:42
-
\$\begingroup\$ And I haven't used with keyword ever till now. Thank you very much for that, I'll certainly keep that in mind from the next time! \$\endgroup\$Ananth– Ananth2014年08月13日 12:43:17 +00:00Commented Aug 13, 2014 at 12:43