I wrote a quick script and wanted some critique, as I don't get to practice my coding often, or even in one language I might make some idiomatic errors.
The program works and runs \145,555ドル\$ entries (\873,330ドル\$ lines in the original data) in approximately five seconds. It may run up to a few million entries later.
So any critique is welcome. The program is explained below.
NB: The code is HEAVILY commented as non-technical users may need to look into the script.
"""
Author: KyleHB
lang: python 3.2
Project Details
The goal of this script to read in multiple files, sort the entries,
& then output them to an output file in a specified format.
The input has 5 lines per data entry, the output has 1. As below.
The data below is truncated & I have added line numbers so the equation for
the character extraction is easily understood. I have used <--- & ---> to
denote notes that I have added for readibility.
Date between **, likes this *1234*, is data that is needed in the output file:
0> 0000001
1> *1234 5678 9101 1121* <--Unique 19 digit card PAN
2> <--Blank line-->
3> 01/18 02/90
4> 420 *1234* <-- 4 or 3 digit branchCode
5> <-- redacted track data -->
The output format is for each data entry is as follows:
serialNumbeer,PAN,branchCode.
The length of each element is: 7,19,(4 or 3).
The serialNumber is ordinal and generated for the output file for readibility.
The entries are sorted first by branchCode and then by PAN.
"""
import glob
from time import strftime as t_stamp
import os
readIn = 0
wroteOut = 0
dupeList = {}
# Creates the times tamp for the output files
stamp = t_stamp('%Y%m%d')
def combine_files():
"""" This function reads in all files, including subdirectories,
in the input folder & combines them into a temporary file."""
# Reads all the files & folders in the input directory.
readFiles = glob.glob('input/*/*.*')
# Creates temporary file to host all the combined entries.
with open('.Combined', 'wb') as outfile:
for f in readFiles:
with open(f, 'rb') as infile:
# Combines all the files using the for loops.
outfile.write(infile.read())
outfile.close()
def extract_data():
"""This function extracts the data from the combined file
it reads the entries into a dictionary to prevent duplicate records.
It then converts it to a list with the branch code first.
Finally it passes the list to the sorting function"""
# Counts all the lines in the combined file.
lineCount = sum(1 for line in open('.Combined'))
f = open('.Combined')
# reads in all the lines.
lines = f.readlines()
# Temporary dictionary to prevent duplication.
tempDict = {}
global dupeList
for x in range(int(lineCount/6)):
# Points to each PAN and copies it
PAN = lines[x*6+1][0:19]
# Points ot each branch code. Strips the trailing space & pads it.
branchCode = '%04d' % int(lines[x*6+4][5:9].strip())
# Checks if the PAN already exists in the tempDict.
if PAN in tempDict:
# if it does it appends the PAN to the dupeList.
dupeList.append(PAN+','+branchCode)
# Adds the PAN to the dictionary whether or not it exists already.
tempDict[PAN] = branchCode
global readIn
# Sets the global variable to the amount of entries processed.
readIn = int(lineCount/6)
f.close()
# Deletes the temporary .Combined file.
os.remove('.Combined')
sort_data(tempDict)
def sort_data(genDict):
"""This function is used to sort the list of entries passed to it."""
newList = []
# Converts the dictionary into a list which is sortable
for key, value in genDict.items():
# branchCode first so it sorts correctly
newList.append([value, key])
newList.sort()
write_output(newList)
def write_output(finalList):
"""This function writes the final values to the output file"""
outFile = open('REDACTED_%s.txt' % stamp, 'w')
global wroteOut
wroteOut = len(finalList)
x = 1
for item in finalList:
# Constructs the lines to be written to the output.txt
finalOut = str('%07d' % x + ',' + item[1] + ',' + item[0]) + '\n'
if x == wroteOut:
finalOut = finalOut[0:-1]
outFile.write(finalOut)
x += 1
def write_duplicate():
"""This function checks if the dupeList is empty,
if it is not then it writes the list to file"""
if not dupeList:
# If the list does not exist then exit the function
return
dupeFile = open('dupefile_%s.txt' % stamp, 'w')
x = 0
for item in dupeList:
tempStr = item + "\n"
if x == len(dupeList)-1:
tempStr = tempStr[0:-1]
dupeFile.write()
def print_result():
"""Prints the results of the script"""
duplicates = str(readIn-wroteOut)
result = str(readIn) + ' Entries read' + '\n' \
+ str(wroteOut) + ' Entries written' + '\n' \
+ duplicates + ' Duplicates'
print(result)
combine_files()
extract_data()
write_duplicate()
print_result()
1 Answer 1
You shouldn't be documenting every line of the script. The code itself should tell someone WHAT it is doing. While comments within the code are better are describing WHY the code is doing something. Pairing this style of commenting with descriptive function names and good docstrings for those functions should make the code as a whole more readable.
I can think of two exceptions where it's OK for a comment to say what the code is doing.
- The block of code is doing something tricky, that will be hard to immediately discern. This is usually paired with a comment explaining why the code should be written this way instead of a more straight forward way.
- The code is used and edited by people who don't see themselves as programmers. They might just be writing a short script to pull some collected data from a database and do some statistical analysis on them. In general, these type of people don't have the strongest programming background and might not be well versed the language and standard libraries.
As it is some of the documentation of specific lines is not correct. Here is one example:
readFiles = glob.glob('input/*/*.*') # reads all the files & folders in the input directory
glob()
does not read in the files. It produces the file paths that match the query string.
It seems like you don't fully understand what the context manager does when you open a file.
with open('.Combined', 'wb') as outfile:
for f in readFiles:
with open(f, 'rb') as infile:
outfile.write(infile.read())
outfile.close()
When the code exists the with
block, the context manage's __exit__()
will be called. In the case of file objects, this will automatically close the file for you. This means you don't need the last line. In other places in the script, you don't use a context manager at all for an open file where you could.
An added benefit of a with
block is that __exit__()
will still be called if an exception is thrown. This is not the case when you explicitly call close()
later in the function.
It is generally better to pass arguments into a function than to use the global keyword. It will make it more explicit about where values are being changed. And can make the code more testable as there is less global state to be concerned with.
Replace most of the hard coded number and strings with named constants. Using a constant will make it clearer what a value represents and when the same value means the same thing. Every instance of 6 might mean the same thing, but without a named constant telling you that, you have to check each case when 6 needs to change to 7.
-
\$\begingroup\$ Great thank you. The code is heavily commented because it might be that non-technical people need to look at it some point into the future. I have remove the close(), also I understand what glob does now. Initially I just needed it work as it was on a deadline. I am not a developer by trade. \$\endgroup\$Dinocv– Dinocv2015年07月31日 12:07:24 +00:00Commented Jul 31, 2015 at 12:07
#
for docstrings. That's what"""
is for. \$\endgroup\$n+1
th person thatthisStyle
should be replaced bythis_style
! \$\endgroup\$