This is a follow up to my original post:
I've written a class that takes a file, validates the formatting of the lines from an input file and writes the set of valid lines to an output file. Each line of the file should have a first name, last name, phone number, color, and zip code. A zip code is valid if it has only 5 characters, a phone number can have only 10 digits (in addition to dashes/parentheses in appropriate places).
In the last post I received some good feedback and added in a few of the suggestions. The main changes I made were to the is_color()
and is_phone_number()
methods as I switched to using regex to validate the input strings. I needed to keep the is_name()
function the same because it needed to check for a couple of different configurations and arrangements of the first and last name. I also kept validate_line()
the same because the function needed to know the indices of each token of the line (i.e. the phone number, the color, and the zip code) and reformat the line if necessary, returning a dictionary of those values to be used by parse_file()
. Also, I mainly wanted to focus on extensibility here as opposed to conciseness.
The code seems to be working but I just wanted to make sure that there was nothing that was being done inefficiently or that is not optimal stylistically.
The updated code is below:
""" file_formatter module
The class contained in this module validates a CSV file based on a set of internally
specified accepted formats and generates a JSON file containing normalized forms of the
valid lines from the CSV file.
Example:
The class in this module can be imported and passed an initial value for the input data
file from the command line like this:
$ python package_name name_of_data_file.in
Classes:
FileFormatter: Takes an input file and output its valid lines to a result file.
"""
import json
import re
class FileFormatter:
""" Takes an input file and output its valid lines to a result file.
Validates the formatting of the lines from an input file and writes the set of valid lines
to an output file.
Attributes:
info_configs: A list containing lists of "accepted" configurations of the data from each line of the input file.
in_file_name: Name of the input file.
res_file_name: Name of the output file.
"""
info_configs = [["phone","color","zip"], ["color","zip","phone"], ["zip","phone","color"]]
def __init__(self,start_file_name,out_file_name):
"""Initialize FileFormatter class with the input and output file names."""
self.in_file_name = start_file_name
self.res_file_name = out_file_name
def validate_line(self, line):
"""Validates that each line is in the correct format.
Takes a line from a file, validate that the first two elements are properly formatted
names, then validates that the remaining elements (phone number, zip code, color)
in the line are properly formatted.
Args:
line: A line from a file
Returns:
A list of tokenized elements from the original line (string) in the correct order
according to the specified format. For example:
[Lastname, Firstname, (703)-742-0996, Blue, 10013] or
[Firstname, Lastname, Red, 11237, 703 955 0373] or
[Firstname, Lastname, 10013, 646 111 0101, Green]
If a value of None is returned, some element in the line wasn't in the correct format.
"""
line = tokenize(line)
if len(line) != 5:
return None
full_name = (line[0], line[1])
if not is_name(full_name):
return None
entry = { "color": "", "firstname": "", "lastname": "", "phonenumber": "", "zipcode": ""}
config = ["","",""]
phone_idx = 0
zip_idx = 0
color_idx = 0
for i in range(2,len(line)):
if is_phone_number(line[i]):
phone_idx = i-2
config[phone_idx] = "phone"
if is_zip_code(line[i]):
zip_idx = i-2
config[zip_idx] = "zip"
if is_color(line[i]):
color_idx = i-2
config[color_idx] = "color"
if config in self.info_configs: # if the phone number, zip code, and color have been found and are in correct order
if phone_idx == 0:
line[0], line[1] = line[1], line[0]
line = [token.strip(" ") for token in line]
line = [token.replace(",","") for token in line]
line[-1] = line[-1].replace("\n","")
entry["firstname"] = line[0]
entry["lastname"] = line[1]
entry["color"] = line[color_idx+2]
entry["phonenumber"] = line[phone_idx+2]
entry["zipcode"] = line[zip_idx+2]
return entry
return None
def parse_file(self):
"""Parses an input file, validates the formatting of its lines, and writes a JSON file with the properly formatted lines.
Iterates through the input file validating each line. Creates a dictionary that contains
a list of entries comprised of valid lines from the input file. Creates a JSON object
of normalized data sorted in ascending order by a tuple of (lastname, firstname) for each line.
"""
lines_dict = {}
json_dict = {}
errors = []
with open(self.in_file_name,'r') as info_file:
for line_number, line in enumerate(info_file):
valid_line = self.validate_line(line)
if valid_line:
lines_dict[(valid_line["lastname"],valid_line["firstname"])] = valid_line
else:
errors.append(line_number)
json_dict["entries"] = [lines_dict[key] for key in sorted(lines_dict.keys(), reverse = True)] # sort by (lastname, firstname,) key value
json_dict["errors"] = errors
with open(self.res_file_name,'w') as out_file:
json.dump(json_dict, out_file, indent = 2)
# utility methods used by the FileFormatter class
def is_phone_number(token):
"""Determines if the passed in string represents a properly formatted 10-digit phone number.
Takes a string and validates that it matches one of two valid formats specified for a phone number.
Uses regular expression parsing to validate that the sequence of characters is an pattern match to one of
the valid formats.
Args:
token: A fragment of a line of a file
Returns:
A boolean indicating if the string is a properly formatted phone number.
"""
token = token.strip(" ")
token = token.replace("\n","")
token = token.replace(",","")
return (re.match(r'\(\d{3}\)-\d{3}-\d{4}$', token) is not None or
re.match(r'\d{3} \d{3} \d{4}$', token) is not None)
def is_zip_code(token):
"""Determines if the passed in string represents a properly formatted 5-digit zip code.
Takes a string and using regular expression parsing to validate that it matches the valid
format specified for a zip code. Validates that the string doesn't contain more than 5 numbers.
Args:
token: A fragment of a line of a file
Returns:
A boolean indicating if the string is a properly formatted zip code.
"""
token = token.strip(" ")
token = token.replace("\n","")
token = token.replace(",","")
return re.match(r'\d{5}$', token) is not None
def is_color(token):
"""Determines if the passed in string represents a color.
Takes a string and uses regular expression parsing to validate that
it matches the valid format specified for a color. Validates that
it is only a one word color.
Args:
token: A fragment of a line of a file
Returns:
A boolean indicating if the string is a properly formatted color.
"""
token = token.strip(" ")
token = token.replace("\n","")
token = token.replace(",","")
return re.match(r'[a-z]*$', token) is not None
def is_name(name_tuple):
"""Determines if the first two elements in a file line (names) are correctly formatted.
Takes a tuple of elements and validates that they match one of two valid formats. Either both
words end in a comma or the second one does while the first one doesn't.
Args:
name_tuple: A tuple of two elements (first and last name) from a line in a file
Returns:
A boolean indicating if the elements (names) in the tuple are correctly formatted.
"""
names = (name_tuple[0].strip(" "), name_tuple[1].strip(" "))
name1_comma = False
name2_comma = False
for i in range(2):
curr_len = len(names[i])
for j in range(curr_len):
if not names[i][j].isalpha() and j < curr_len -1:
return False
if j == curr_len - 1 and i == 0 and names[i][j] == ',':
name1_comma = True
if j == curr_len - 1 and i == 1 and names[i][j] == ',':
name2_comma = True
comma_first_case = name1_comma and name2_comma # both have commas
comma_second_case = not name1_comma and name2_comma # name2 has comma, name 1 doesnt
if not (comma_first_case or comma_second_case):
return False
return True
def tokenize(line):
"""Splits the passed in string on the delimiter and return a list of tokens.
Takes a string and splits it on a delimter while maintaining the delimiter in its
original position in the string. If the first word in the string doesn't end with a comma,
the split operation will yield four tokens instead of five so the first two words (names) are broken
up by the space character.
Args:
line: A string to be broken up into tokens based on a delimiter.
Returns:
A list of tokens (words) from the passed in line.
"""
delim = ","
tokens = [e + delim for e in line.split(delim) if e]
if len(tokens) == 4:
names = tokens[0].split(" ")
names[0] += delim
names[1] = " " + names[1]
info = tokens[1:]
tokens = []
tokens.extend(names)
tokens.extend(info)
return tokens
2 Answers 2
is_name
: First comma is irrelevant
You say:
Either both words end in a comma or
the second one does while the first one doesn't.
So the truth table is:
First comma | Second comma | Output
True | True | True
False | True | True
True | False | False
False | False | False
Looking at this table it is clear that the return value of the function is_name
is exactly the boolean value of Second comma
so you can avoid the first comma
variable altogether, as it is irrelevant:
def is_name(name_tuple):
"""Docstring..."""
names = (name_tuple[0].strip(" "), name_tuple[1].strip(" "))
name2_comma = False
for i in range(2):
curr_len = len(names[i])
for j in range(curr_len):
if not names[i][j].isalpha() and j < curr_len -1:
return False
if j == curr_len - 1 and i == 1 and names[i][j] == ',':
name2_comma = True
return name2_comma
This is better because it is clearly simpler.
-
\$\begingroup\$ Yeah this is definitely simpler. I'll make these changes. Thanks! \$\endgroup\$loremIpsum1771– loremIpsum17712017年04月05日 22:31:29 +00:00Commented Apr 5, 2017 at 22:31
Regexes to the rescue again! At the end of my answer to your last question, I arrived at a way to validate the whole line in one go. What I missed is that you actually need the different components to assign them to fields.
Enter capture groups. Just by putting ()
around (parts of) a regex, the content will be put into a group:
name_comma = r'([A-Z][a-z]*), ([A-Z][a-z]*)'
name_no_comma = r'([A-Z][a-z]*) ([A-Z][a-z]*)'
phone_paren = r'(\(\d{3}\)-\d{3}-\d{4})'
phone_space = r'(\d{3} \d{3} \d{4})'
zip_code = r'(\d{5})'
color = r'([A-Z]?[a-z]*)'
# Lastname, Firstname, (703)-742-0996, Blue, 10013
# Firstname Lastname, Red, 11237, 703 955 0373
# Firstname, Lastname, 10013, 646 111 0101, Green
ACCEPTABLE_FORMATS = [", ".join([name_comma, phone_paren, color, zip_code]),
", ".join([name_no_comma, color, zip_code, phone_space]),
", ".join([name_comma, zip_code, phone_space, color])]
def validate_line(line):
for pattern in ACCEPTABLE_FORMATS:
match = re.match(pattern, line)
if match is not None:
return match.groups()
For the three given examples this returns these tuples:
>>> validate_line("Firstname Lastname, Red, 11237, 703 955 0373")
('Firstname', 'Lastname', 'Red', '11237', '703 955 0373')
>>> validate_line("Lastname, Firstname, (703)-742-0996, Blue, 10013")
('Lastname', 'Firstname', '(703)-742-0996', 'Blue', '10013')
>>> validate_line("Firstname, Lastname, 10013, 646 111 0101, Green")
('Firstname', 'Lastname', '10013', '646 111 0101', 'Green')
Having a look at https://docs.python.org/2/library/re.html does probably not hurt. You can even give the groups names, and get directly a dict out:
name_comma = r'(?P<lastname>[A-Z][a-z]*), (?P<firstname>[A-Z][a-z]*)'
name_no_comma = r'(?P<firstname>[A-Z][a-z]*) (?P<lastname>[A-Z][a-z]*)'
phone_paren = r'(?P<phonenumber>\(\d{3}\)-\d{3}-\d{4})'
phone_space = r'(?P<phonenumber>\d{3} \d{3} \d{4})'
zip_code = r'(?P<zipcode>\d{5})'
color = r'(?P<color>[A-Z]?[a-z]*)'
...
def validate_line(line):
for pattern in ACCEPTABLE_FORMATS:
match = re.match(pattern, line)
if match is not None:
return match.groupdict()
With these dicts, you can just update
your records.
>>> validate_line("Firstname Lastname, Red, 11237, 703 955 0373")
{'color': 'Red', 'lastname': 'Lastname', 'phonenumber': '703 955 0373', 'zipcode': '11237', 'firstname': 'Firstname'}
>>> validate_line("Lastname, Firstname, (703)-742-0996, Blue, 10013")
{'color': 'Blue', 'lastname': 'Lastname', 'phonenumber': '(703)-742-0996', 'zipcode': '10013', 'firstname': 'Firstname'}
>>> validate_line("Firstname, Lastname, 10013, 646 111 0101, Green")
{'color': 'Green', 'lastname': 'Firstname', 'phonenumber': '646 111 0101', 'zipcode': '10013', 'firstname': 'Lastname'}
Note that you could pre-compile your regexes, which, theoretically, should make them faster. But Python already does that for you and caches it, so it does not really change it. But you can, of course:
ACCEPTABLE_FORMATS = map(lambda s: re.compile(", ".join(s)),
[[name_comma, phone_paren, color, zip_code],
[name_no_comma, color, zip_code, phone_space],
[name_comma, zip_code, phone_space, color]])
def validate_line(line):
for pattern in ACCEPTABLE_FORMATS:
match = pattern.match(line)
if match is not None:
return match.groupdict()
-
\$\begingroup\$ Do you think it would make sense to pre-compile the regexes beforehand? And, may be define them in upper-case since they are "constants"..thanks, great answer. \$\endgroup\$alecxe– alecxe2017年04月05日 19:34:37 +00:00Commented Apr 5, 2017 at 19:34
-
\$\begingroup\$ @alecxe The latter, definitely, for the former, Python caches regexes, so you won't gain anything (have a look at stackoverflow.com/questions/452104/…) \$\endgroup\$Graipher– Graipher2017年04月05日 19:37:17 +00:00Commented Apr 5, 2017 at 19:37
-
\$\begingroup\$ Right, I am thinking this way..we may expect the second, third and latter calls with the same regex be retrieved from cache..but, the first time
re.match()
is called the engine is going to compile the expression - this is going to happen during the runtime. But, if we callre.compile()
when defining the expressions, we would do this during the compile time..hope that makes sense. Thanks. \$\endgroup\$alecxe– alecxe2017年04月05日 19:42:07 +00:00Commented Apr 5, 2017 at 19:42 -
\$\begingroup\$ @alecxe True, if you do it yourself it will be compiled right at the beginning, so all calls are of the same time. \$\endgroup\$Graipher– Graipher2017年04月05日 19:42:57 +00:00Commented Apr 5, 2017 at 19:42
-
\$\begingroup\$ It looks like this does handle the case for when the first and last names appear out of order, but would this design be considered extensible? Would it not be better to separate the individual pieces of functionality into separate methods? \$\endgroup\$loremIpsum1771– loremIpsum17712017年04月05日 21:44:49 +00:00Commented Apr 5, 2017 at 21:44