Card data extraction from file

Question 1

I wrote a quick script and wanted some critique, as I don't get to practice my coding often, or even in one language I might make some idiomatic errors.

The program works and runs \145,555ドル\$ entries (\873,330ドル\$ lines in the original data) in approximately five seconds. It may run up to a few million entries later.

So any critique is welcome. The program is explained below.

NB: The code is HEAVILY commented as non-technical users may need to look into the script.

"""
Author: KyleHB
lang: python 3.2
Project Details
The goal of this script to read in multiple files, sort the entries,
& then output them to an output file in a specified format.
The input has 5 lines per data entry, the output has 1. As below.
The data below is truncated & I have added line numbers so the equation for
the character extraction is easily understood. I have used <--- & ---> to
denote notes that I have added for readibility.
Date between **, likes this *1234*, is data that is needed in the output file:
0> 0000001
1> *1234 5678 9101 1121* <--Unique 19 digit card PAN
2> <--Blank line-->
3> 01/18 02/90
4> 420 *1234* <-- 4 or 3 digit branchCode
5> <-- redacted track data -->
The output format is for each data entry is as follows:
serialNumbeer,PAN,branchCode.
The length of each element is: 7,19,(4 or 3).
The serialNumber is ordinal and generated for the output file for readibility.
The entries are sorted first by branchCode and then by PAN.
"""
import glob
from time import strftime as t_stamp
import os
readIn = 0
wroteOut = 0
dupeList = {}
# Creates the times tamp for the output files
stamp = t_stamp('%Y%m%d')
def combine_files():
 """" This function reads in all files, including subdirectories,
 in the input folder & combines them into a temporary file."""
 # Reads all the files & folders in the input directory.
 readFiles = glob.glob('input/*/*.*')
 # Creates temporary file to host all the combined entries.
 with open('.Combined', 'wb') as outfile:
 for f in readFiles:
 with open(f, 'rb') as infile:
 # Combines all the files using the for loops.
 outfile.write(infile.read())
 outfile.close()
def extract_data():
 """This function extracts the data from the combined file
 it reads the entries into a dictionary to prevent duplicate records.
 It then converts it to a list with the branch code first.
 Finally it passes the list to the sorting function"""
 # Counts all the lines in the combined file.
 lineCount = sum(1 for line in open('.Combined'))
 f = open('.Combined')
 # reads in all the lines.
 lines = f.readlines()
 # Temporary dictionary to prevent duplication.
 tempDict = {}
 global dupeList
 for x in range(int(lineCount/6)):
 # Points to each PAN and copies it
 PAN = lines[x*6+1][0:19]
 # Points ot each branch code. Strips the trailing space & pads it.
 branchCode = '%04d' % int(lines[x*6+4][5:9].strip())
 # Checks if the PAN already exists in the tempDict.
 if PAN in tempDict:
 # if it does it appends the PAN to the dupeList.
 dupeList.append(PAN+','+branchCode)
 # Adds the PAN to the dictionary whether or not it exists already.
 tempDict[PAN] = branchCode
 global readIn
 # Sets the global variable to the amount of entries processed.
 readIn = int(lineCount/6)
 f.close()
 # Deletes the temporary .Combined file.
 os.remove('.Combined')
 sort_data(tempDict)
def sort_data(genDict):
 """This function is used to sort the list of entries passed to it."""
 newList = []
 # Converts the dictionary into a list which is sortable
 for key, value in genDict.items():
 # branchCode first so it sorts correctly
 newList.append([value, key])
 newList.sort()
 write_output(newList)
def write_output(finalList):
 """This function writes the final values to the output file"""
 outFile = open('REDACTED_%s.txt' % stamp, 'w')
 global wroteOut
 wroteOut = len(finalList)
 x = 1
 for item in finalList:
 # Constructs the lines to be written to the output.txt
 finalOut = str('%07d' % x + ',' + item[1] + ',' + item[0]) + '\n'
 if x == wroteOut:
 finalOut = finalOut[0:-1]
 outFile.write(finalOut)
 x += 1
def write_duplicate():
 """This function checks if the dupeList is empty,
 if it is not then it writes the list to file"""
 if not dupeList:
 # If the list does not exist then exit the function
 return
 dupeFile = open('dupefile_%s.txt' % stamp, 'w')
 x = 0
 for item in dupeList:
 tempStr = item + "\n"
 if x == len(dupeList)-1:
 tempStr = tempStr[0:-1]
 dupeFile.write()
def print_result():
 """Prints the results of the script"""
 duplicates = str(readIn-wroteOut)
 result = str(readIn) + ' Entries read' + '\n' \
 + str(wroteOut) + ' Entries written' + '\n' \
 + duplicates + ' Duplicates'
 print(result)
combine_files()
extract_data()
write_duplicate()
print_result()

Question 2

Just a note, keep your version history in a seperate file. Also, don't use # for docstrings. That's what """ is for.

Question 3

I would suggest you: 1. delete this question; 2. make the basic edits to follow PEP-8 (or tell us which style guide you are using) and switch to docstrings; and 3. undelete the question again. Then we can focus on the interesting things, rather than just tell the n+1th person that thisStyle should be replaced by this_style!

Question 4

@KyleMHB yes, they don't necessarily have to be on the same line (if it's on the line before the statement, it's conventional to put a blank line before the comment too).

Question 5

When you say "non-technical users may need to look into the script", are you expecting them to need to edit it? If so, they will need to understand enough of the language to do so to make many of the comments redundant. If not, then they don't need line-by-line details, just put enough in the docstring that they know what each function does (if the docstring gets too long, consider refactoring into multiple functions). Comments that just repeat what the code does are pointless, only add them where you need to explain why it does that.

Question 6

Please do not update the code in your question to incorporate feedback from answers, doing so goes against the Question + Answer style of Code Review. This is not a forum where you should keep the most updated version in your question. Please see what you may and may not do after receiving answers .

Question 7

You shouldn't be documenting every line of the script. The code itself should tell someone WHAT it is doing. While comments within the code are better are describing WHY the code is doing something. Pairing this style of commenting with descriptive function names and good docstrings for those functions should make the code as a whole more readable.

I can think of two exceptions where it's OK for a comment to say what the code is doing.

The block of code is doing something tricky, that will be hard to immediately discern. This is usually paired with a comment explaining why the code should be written this way instead of a more straight forward way.
The code is used and edited by people who don't see themselves as programmers. They might just be writing a short script to pull some collected data from a database and do some statistical analysis on them. In general, these type of people don't have the strongest programming background and might not be well versed the language and standard libraries.

As it is some of the documentation of specific lines is not correct. Here is one example:

readFiles = glob.glob('input/*/*.*') # reads all the files & folders in the input directory

glob() does not read in the files. It produces the file paths that match the query string.

It seems like you don't fully understand what the context manager does when you open a file.

with open('.Combined', 'wb') as outfile:
 for f in readFiles:
 with open(f, 'rb') as infile:
 outfile.write(infile.read())
outfile.close()

When the code exists the with block, the context manage's __exit__() will be called. In the case of file objects, this will automatically close the file for you. This means you don't need the last line. In other places in the script, you don't use a context manager at all for an open file where you could.

An added benefit of a with block is that __exit__() will still be called if an exception is thrown. This is not the case when you explicitly call close() later in the function.

It is generally better to pass arguments into a function than to use the global keyword. It will make it more explicit about where values are being changed. And can make the code more testable as there is less global state to be concerned with.

Replace most of the hard coded number and strings with named constants. Using a constant will make it clearer what a value represents and when the same value means the same thing. Every instance of 6 might mean the same thing, but without a named constant telling you that, you have to check each case when 6 needs to change to 7.

Question 8

Great thank you. The code is heavily commented because it might be that non-technical people need to look at it some point into the future. I have remove the close(), also I understand what glob does now. Initially I just needed it work as it was on a deadline. I am not a developer by trade.

unholysampler unholysampler 6,19119 silver badges31 bronze badges · Accepted Answer · 2015-07-31 11:52:37Z

You shouldn't be documenting every line of the script. The code itself should tell someone WHAT it is doing. While comments within the code are better are describing WHY the code is doing something. Pairing this style of commenting with descriptive function names and good docstrings for those functions should make the code as a whole more readable.

I can think of two exceptions where it's OK for a comment to say what the code is doing.

The block of code is doing something tricky, that will be hard to immediately discern. This is usually paired with a comment explaining why the code should be written this way instead of a more straight forward way.
The code is used and edited by people who don't see themselves as programmers. They might just be writing a short script to pull some collected data from a database and do some statistical analysis on them. In general, these type of people don't have the strongest programming background and might not be well versed the language and standard libraries.

As it is some of the documentation of specific lines is not correct. Here is one example:

readFiles = glob.glob('input/*/*.*') # reads all the files & folders in the input directory

glob() does not read in the files. It produces the file paths that match the query string.

It seems like you don't fully understand what the context manager does when you open a file.

with open('.Combined', 'wb') as outfile:
 for f in readFiles:
 with open(f, 'rb') as infile:
 outfile.write(infile.read())
outfile.close()

When the code exists the with block, the context manage's __exit__() will be called. In the case of file objects, this will automatically close the file for you. This means you don't need the last line. In other places in the script, you don't use a context manager at all for an open file where you could.

An added benefit of a with block is that __exit__() will still be called if an exception is thrown. This is not the case when you explicitly call close() later in the function.

It is generally better to pass arguments into a function than to use the global keyword. It will make it more explicit about where values are being changed. And can make the code more testable as there is less global state to be concerned with.

Replace most of the hard coded number and strings with named constants. Using a constant will make it clearer what a value represents and when the same value means the same thing. Every instance of 6 might mean the same thing, but without a named constant telling you that, you have to check each case when 6 needs to change to 7.

Great thank you. The code is heavily commented because it might be that non-technical people need to look at it some point into the future. I have remove the close(), also I understand what glob does now. Initially I just needed it work as it was on a deadline. I am not a developer by trade.

Stack Exchange Network

Card data extraction from file

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Card data extraction from file

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions