Processing Excel files of a predefined template to generate output CSV files of another predefined template

Question 1

I'm a freelancer and I recently wrote a python script that processes input xlsx files that contain data in a particular format and output a csv file in another format as required.

The code works well and fulfills the requirements, but I am looking for a code review with suggestions on performance, improved readability, better organization, better practices etc. It uses openpyxl library, and is quite long, as there are a lot of operations so I'm sharing you paste links below:

utilities.py
main.py (the main file)

Here are some bits of code from the file main.py (required packages are imported above):

println("Reading arguments for input directory..")
args = sys.argv
try:
 input_dir_name = args[1]
except IndexError:
 # invalid syntax used
 println("ERROR! Please use correct syntax.")
 println("Syntax is python main.py <input folder name> .")
 println("Example: python main.py 2014年05月20日")
 sys.exit(0)
# Select input and output directory
input_dir = 'input/' + input_dir_name
# Iterate over all files in the input directory
println("Reading input from directory ./" + input_dir + "..")
try:
 # Get list of all files
 allfiles = [ f for f in listdir(input_dir) if isfile(join(input_dir, f)) ]
except:
 # Invalid directory
 println("ERROR! The specified input directory was not found.")
 println("Please make sure you are running the script in the correct directory.")
 sys.exit(0)
# create output folder if output folder doesn't exist
output_dir = 'output/' + input_dir_name
if not os.path.exists(output_dir):
 os.makedirs(output_dir)
for f in allfiles:
 # initialize variables
 matrix = []
 rows = cols = i = j = 0
 single_rows = break_row = break_col = 0
 single_rows_complete = break_word_found = 0
 println("New input file found: " + f)
 # Get state and file name without extension
 f_state = utilities.getFileState(f)
 f_name = utilities.removeFileExt(f)
 break_word, bw_is_string = estado(f_state)
 if (break_word == ""):
 println("Omitting input file: " + f + " as no break word was found for this state. State code was: " + f_state)
 continue
 input_file = input_dir + '/' + f
 output_file = output_dir + '/' + f_name + '.csv'
 # Open csv writer for output
 writer = csv.writer(open(output_file, 'wb'))
 println("Output file path: " + repr(output_file))
 # Load first worksheet of input file for processing
 println("Opening Input Excel Workbook..")
 wb = load_workbook(filename = input_file, use_iterators = True)
 ws = wb.active
 # Count number of rows and columns
 println("Counting number of rows and columns..")
 for row in ws.iter_rows():
 rows = rows + 1
 if (rows == 1):
 for cell in row:
 cols = cols + 1
 println("Complete. Total Rows = " + repr(rows) + " and Total Cols = " + repr(cols) + ".")
 # define matrix 2d array for storing excel data
 matrix = [[0 for x in xrange(cols)] for x in xrange(rows)]
 # code continues...

and another..

# Generate output csv heading row
println("Generating Column Names for Output..")
if (break_col == 0): # Content starts from first column
 csv_headers = ['Municipios']
else: # Content starts from second column, with first column being S.No.
 csv_headers = ['S.No.', 'Municipios']
j = break_col + 1
i = break_row - 1
# loop over columns starting from break col's next col
while (j < cols):
 # call the get col label method
 col_label = utilities.getRecursiveColLabel(matrix, i, j, single_rows, break_row)
 # format it so that there is no _ at the end
 col_label = col_label[1:]
 if (col_label[-1:] == '_'):
 col_label = col_label[:-1]
 # println("Column name for col=" + repr(j) + " is: " + col_label)
 csv_headers.append(col_label)
 j = j + 1 # col increment
# write into output file
writer.writerows([
 (csv_headers)
])
println("Column names written to output successfully.")
# start from break row to write actual statistics data
i = break_row
println("Writing Statistics data to output..")
# loop over all rows
while (i < rows):
 # this data var will be stored into the csv finally
 stats_data = []
 # If break word is in first column, that means the names are in the format <int> <str>
 # println("In first col: " + repr(matrix[i][0]))
 if (break_col == 0):
 if (i == break_row):
 # if this is the row containing breakword then format is only <str>
 # stats_data.append(0)
 stats_data.append(matrix[i][0].encode('utf-8'))
 else:
 # here format is <int> <str>
 try:
 parts = matrix[i][break_col].split()
 try:
 if utilities.RepresentsInt(parts[0]):
 # stats_data.append(int(parts[0]))
 stats_data.append(''.join(parts[1:]).encode('utf-8').strip())
 else:
 # if format is something else just append the value as it is
 stats_data.append(''.join(matrix[i][break_col]).encode('utf-8'))
 except:
 stats_data.append(''.join(parts).encode('utf-8').strip())
 except:
 stats_data.append(''.join(str(matrix[i][break_col])))
 else:
 # if breakword is not in the first column, that means input is properly numbered with 123 in col A
 # append directly into output
 if (isinstance(matrix[i][0], basestring)):
 stats_data.append(''.join(matrix[i][0]).encode('utf-8'))
 else:
 stats_data.append(matrix[i][0])
 if (isinstance(matrix[i][1], basestring)):
 stats_data.append(''.join(matrix[i][1]).encode('utf-8'))
 else:
 stats_data.append(matrix[i][1])
 j = break_col + 1
 # loop over all columns starting from next the break word col
 # this does the actual statistics numbers in the input
 while (j < cols):
 # format output
 # println("Here i= " + repr(i) + " and j=" + repr(j))
 c = matrix[i][j]
 if (isinstance(c, basestring)):
 if (c.strip().lower() == "(x)"): # replace (x) values with 0
 stats_data.append(0)
 elif (c.strip() == u'\xad'): # special replacement for unicode problems
 stats_data.append("-")
 else: # if it is a string (-,chars) then it needs to be properly encoded
 stats_data.append(c.encode('utf-8'))
 elif utilities.RepresentsInt(c): # if the number is an integer
 stats_data.append(int(c))
 else: # if its not a string just append it as it is
 stats_data.append(c)
 j = j + 1 # col increment
 # write current row to output csv
 writer.writerows([
 (stats_data)
 ])
 i = i + 1 # row increment
println("Writing output complete successfully.")
println("Closing Input and Output files..")

Question 2

NOTE: I wrote this answer before realizing that you used Python 2 and not Python 3. I don't think that anything in your code prevents you from using Python 3 instead; you only have some small changes to make and you could benefit from the latest cool features.

Use `pathlib`

If you can use Python 3.4, then you may want to use the new pathlib module instead of the old os.path. pathlib provides a Path class that achieves the same things as os.path but in an object-oriented manner. Here is the beginning of your script rewritten with pathlib (I skipped the irrelevant parts for clarity):

from pathlib import Path
try:
 input_dir_name = Path(args[1])
except IndexError:
 # ...
# Select input and output directory
input_dir = 'input' / input_dir_name
# Iterate over all files in the input directory
println("Reading input from directory" + str('.' / input_dir / '..'))
try:
 # Get list of all files
 allfiles = [ f for f in input_dir.iterdir() if f.is_file() ]
except:
 # ...
# create output folder if output folder doesn't exist
output_dir = 'output' / input_dir_name
if not output_dir.exists():
 output_dir.mkdir()

Working on paths is safer than working on strings since you don't have to handle the separators at all, nor do you have to struggle with drives on Windows. They are added when needed. Moreover, if you need a string to deal with legacy file-handling functions, you can still call str on your path to get a nice string representation.

Use the `with` statement

When doing file IO, you should use the with statement. That way, you will have a nice identation to represent when a file is opened and when is closed. Moreover, you won't have to close it by hand, it will automatically be done at the end of the indented block:

# Open csv writer for output
with ouput_file.open('wb') as f:
 writer = csv.writer(f)
 println("Output file path: " + str(output_file))
 # ...
# indentation ends, f is closed

Note that I considered that output_file was a pathlib.Path in this example too.

Question 3

Thank you! Yes, I didn't use Python 3 as I am yet to get into its documentation. I will do that soon :)

Question 4

And I haven't used with keyword ever till now. Thank you very much for that, I'll certainly keep that in mind from the next time!

Morwenn MorwennMorwenn 20.2k3 gold badges69 silver badges132 bronze badges · Accepted Answer · 2014-08-13 10:49:38Z

NOTE: I wrote this answer before realizing that you used Python 2 and not Python 3. I don't think that anything in your code prevents you from using Python 3 instead; you only have some small changes to make and you could benefit from the latest cool features.

Use `pathlib`

If you can use Python 3.4, then you may want to use the new pathlib module instead of the old os.path. pathlib provides a Path class that achieves the same things as os.path but in an object-oriented manner. Here is the beginning of your script rewritten with pathlib (I skipped the irrelevant parts for clarity):

from pathlib import Path
try:
 input_dir_name = Path(args[1])
except IndexError:
 # ...
# Select input and output directory
input_dir = 'input' / input_dir_name
# Iterate over all files in the input directory
println("Reading input from directory" + str('.' / input_dir / '..'))
try:
 # Get list of all files
 allfiles = [ f for f in input_dir.iterdir() if f.is_file() ]
except:
 # ...
# create output folder if output folder doesn't exist
output_dir = 'output' / input_dir_name
if not output_dir.exists():
 output_dir.mkdir()

Working on paths is safer than working on strings since you don't have to handle the separators at all, nor do you have to struggle with drives on Windows. They are added when needed. Moreover, if you need a string to deal with legacy file-handling functions, you can still call str on your path to get a nice string representation.

Use the `with` statement

When doing file IO, you should use the with statement. That way, you will have a nice identation to represent when a file is opened and when is closed. Moreover, you won't have to close it by hand, it will automatically be done at the end of the indented block:

# Open csv writer for output
with ouput_file.open('wb') as f:
 writer = csv.writer(f)
 println("Output file path: " + str(output_file))
 # ...
# indentation ends, f is closed

Note that I considered that output_file was a pathlib.Path in this example too.

Thank you! Yes, I didn't use Python 3 as I am yet to get into its documentation. I will do that soon :)
And I haven't used with keyword ever till now. Thank you very much for that, I'll certainly keep that in mind from the next time!

Stack Exchange Network

Processing Excel files of a predefined template to generate output CSV files of another predefined template

1 Answer 1

Use `pathlib`

Use the `with` statement

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Processing Excel files of a predefined template to generate output CSV files of another predefined template

1 Answer 1

Use pathlib

Use the with statement

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions

Use `pathlib`

Use the `with` statement