I've written a class that takes a file, validates the formatting of the lines from an input file and writes the set of valid lines to an output file. Each line of the file should have a first name, last name, phone number, color, and zip code. A zip code is valid if it has only 5 characters, a phone number can have only 10 digits (in addition to dashes/parentheses in appropriate places). The accepted formats of each line of the input file are the following:
Lastname, Firstname, (703)-742-0996, Blue, 10013
Firstname Lastname, Red, 11237, 703 955 0373
Firstname, Lastname, 10013, 646 111 0101, Green
The program needs to write a JSON object with all of the valid lines from the input file in a list sorted in ascending alphabetical order by (last name, first name).
These are the test cases I ran with it as well as the JSON output. I think I've identified all of the edge cases with the tests but I could have missed something. This code should exemplify good design choices and extensibility and should be production quality. Should anything be added/removed from the solution to meet these requirements?
Also, any tests that would make the code fail are welcome.
The code for the solution is below:
__main__.py
import sys
from file_formatter import FileFormatter
if __name__ == "__main__":
formatter = FileFormatter(sys.argv[-1],"result.out")
formatter.parse_file()
file_formatter.py
""" file_formatter module
The class contained in this module validates a CSV file based on a set of internally
specified accepted formats and generates a JSON file containing normalized forms of the
valid lines from the CSV file.
Example:
The class in this module can be imported and passed an initial value for the input data
file from the command line like this:
$ python example_program.py name_of_data_file.in
Classes:
FileFormatter: Takes an input file and output its valid lines to a result file.
"""
import json
class FileFormatter:
""" Takes an input file and output its valid lines to a result file.
Validates the formatting of the lines from an input file and writes the set of valid lines
to an output file.
Attributes:
info_configs: A list containing lists of "accepted" configurations of the data from each line of the input file.
in_file_name: Name of the input file.
res_file_name: Name of the output file.
"""
info_configs = [["phone","color","zip"], ["color","zip","phone"], ["zip","phone","color"]]
def __init__(self,start_file_name,out_file_name):
"""Initialize FileFormatter class with the input and output file names."""
self.in_file_name = start_file_name
self.res_file_name = out_file_name
def validate_line(self, line):
"""Validates that each line is in the correct format.
Takes a line from a file, validate that the first two elements are properly formatted
names, then validates that the remaining elements (phone number, zip code, color)
in the line are properly formatted.
Args:
line: A line from a file
Returns:
A list of tokenized elements from the original line (string) in the correct order
according to the specified format. For example:
[Lastname, Firstname, (703)-742-0996, Blue, 10013] or
[Firstname, Lastname, Red, 11237, 703 955 0373] or
[Firstname, Lastname, 10013, 646 111 0101, Green]
If a value of None is returned, some element in the line wasn't in the correct format.
"""
line = tokenize(line)
if len(line) != 5:
return None
full_name = (line[0],line[1])
if not is_name(full_name):
return None
config = ["","",""]
entry = { "color": "", "firstname": "", "lastname": "", "phonenumber": "", "zipcode": ""}
phone_idx = 0
zip_idx = 0
color_idx = 0
for i in range(2,len(line)):
if is_phone_number(line[i]):
phone_idx = i-2
config[phone_idx] = "phone"
if is_zip_code(line[i]):
zip_idx = i-2
config[zip_idx] = "zip"
if is_color(line[i]):
color_idx = i-2
config[color_idx] = "color"
if config in self.info_configs: # if the phone number, zip code, and color have been found and are in correct order
if phone_idx == 0:
line[0], line[1] = line[1], line[0]
line = [token.strip(" ") for token in line]
line = [token.replace(",","") for token in line]
line[len(line)-1] = line[len(line)-1].replace("\n","")
entry["firstname"] = line[0]
entry["lastname"] = line[1]
entry["color"] = line[color_idx+2]
entry["phonenumber"] = line[phone_idx+2]
entry["zipcode"] = line[zip_idx+2]
return entry
return None
def parse_file(self):
"""Parses an input file, validates the formatting of its lines, and writes a JSON file with the properly formatted lines.
Iterates through the input file validating each line. Creates a dictionary that contains
a list of entries comprised of valid lines from the input file. Creates a JSON object
of normalized data sorted in ascending order by a tuple of (lastname, firstname) for each line.
"""
lines_dict = {}
json_dict = {}
errors = []
with open(self.in_file_name,'r') as info_file:
i = 0
for line in info_file:
valid_line = self.validate_line(line)
if valid_line:
lines_dict[(valid_line["lastname"],valid_line["firstname"])] = valid_line
else:
errors.append(i)
i += 1
json_dict["entries"] = [lines_dict[key] for key in sorted(lines_dict.keys(), reverse = True)] # sort by (lastname, firstname,) key value
json_dict["errors"] = errors
with open(self.res_file_name,'w') as out_file:
json.dump(json_dict, out_file, indent = 2)
# utility methods for parsing the file
def tokenize(line):
"""Splits the passed in string on the delimiter and return a list of tokens.
Takes a string and splits it on a delimter while maintaining the delimiter in its
original position in the string. If the first word in the string doesn't end with a comma,
the split operation will yield four tokens instead of five so the first two words (names) are broken
up by the space character.
Args:
line: A string to be broken up into tokens based on a delimiter.
Returns:
A list of tokens (words) from the passed in line.
"""
delim = ","
tokens = [e + delim for e in line.split(delim) if e]
if len(tokens) == 4:
names = tokens[0].split(" ")
names[0] = names[0] + delim
names[1] = " " + names[1]
info = tokens[1:]
tokens = []
tokens.extend(names)
tokens.extend(info)
return tokens
def is_name(name_tuple):
"""Determines if the first two elements in a file line (names) are correctly formatted.
Takes a tuple of elements and validates that they match one of two valid formats. Either both
words end in a comma or the second one does while the first one doesn't.
Args:
name_tuple: A tuple of two elements (first and last name) from a line in a file
Returns:
A boolean indicating if the elements (names) in the tuple are correctly formatted.
"""
names = (name_tuple[0].strip(" "), name_tuple[1].strip(" "))
comma_first_case = False
comma_second_case = False
name1_comma = False
name2_comma = False
for i in range(2):
curr_len = len(names[i])
for j in range(curr_len):
if not names[i][j].isalpha() and j < curr_len -1:
return False
if j == curr_len - 1 and i == 0 and names[i][j] == ',':
name1_comma = True
if j == curr_len - 1 and i == 1 and names[i][j] == ',':
name2_comma = True
comma_first_case = name1_comma and name2_comma # both have commas
comma_second_case = not name1_comma and name2_comma # name2 has comma, name 1 doesnt
if not (comma_first_case or comma_second_case):
return False
return True
def is_phone_number(token):
"""Determines if the passed in string represents a properly formatted 10-digit phone number.
Takes a string and validates that it matches one of two valid formats specified for a phone number.
Validates that the sequence of characters is an exact match to one of the valid formats.
Args:
token: A fragment of a line of a file
Returns:
A boolean indicating if the string is a properly formatted phone number.
"""
token = token.strip(" ")
char_sequence = []
case_1 = ["paren","number","number","number","paren","dash","number","number","number","dash","number","number","number","number"]
case_2 = ["number","number","number","space","number","number","number","space","number","number","number","number"]
for char in token:
is_paren = char == "(" or char == ")"
is_dash = char == "-"
is_ws = char == " "
if represents_int(char):
char_sequence.append("number")
if is_paren:
char_sequence.append("paren")
if is_dash:
char_sequence.append("dash")
if is_ws:
char_sequence.append("space")
if char_sequence == case_1 or char_sequence == case_2:
return True
return False
def is_color(token):
"""Determines if the passed in string represents a color.
Takes a string and validates that it matches the valid formats specified for a color.
Validates that it is only a one word color.
Args:
token: A fragment of a line of a file
Returns:
A boolean indicating if the string is a properly formatted color.
"""
token = token.strip(" ")
for i in range(len(token)):
if token[i] != "," and token[i] != "\n":
if not token[i].isalpha() or not token[i].islower() :
return False
return True
def is_zip_code(token):
"""Determines if the passed in string represents a properly formatted 5-digit zip code.
Takes a string and validates that it matches the valid formats specified for a zip code.
Validates that the string doesn't contain more than 5 numbers.
Args:
token: A fragment of a line of a file
Returns:
A boolean indicating if the string is a properly formatted zip code.
"""
token = token.strip(" ")
digit_count = 0
for digit in token:
if digit != "," and digit != "\n":
if represents_int(digit):
digit_count += 1
else:
return False
if digit_count != 5:
return False
return True
def represents_int(char):
"""Determines if the passed in character represents an integer.
Takes a char and attempts to convert it to an integer.
Args:
char: A character
Returns:
A boolean indicating if the passed in character represents an integer.
Raises:
ValueError: An error occured when trying to convert the character to an integer.
"""
try:
int(char)
return True
except ValueError:
return False
if __name__ == "__main__":
formatter= FileFormatter("data.in","result.out")
formatter.parse_file()
-
\$\begingroup\$ hello you can use python export emails from gmail contacts csv project from git look github.com/bestofg/python-export-gmail-emails-from-contacts-csv \$\endgroup\$mahdi bahri– mahdi bahri2017年07月06日 16:39:58 +00:00Commented Jul 6, 2017 at 16:39
2 Answers 2
Your function is_phone_number
is the prime example for the usage of regular expressions. You are basically trying to implement it yourself here!
You can either use two different patterns here:
import re
def is_phone_number(token):
token = token.strip(" ")
return (re.match(r'\(\d{3}\)-\d{3}-\d{4}$', token) is not None or
re.match(r'\d{3} \d{3} \d{4}$', token) is not None)
Here, \d
is any digit, \d{n}
is a run of n digits and $
is the end of the string (to make sure there is nothing after the valid phone number).
You could also combine it to one pattern:
def is_phone_number(token):
token = token.strip(" ")
return re.match(r'\(?\d{3}\)?[ -]\d{3}[ -]\d{4}$', token) is not None
This second pattern has the caveat, that it allows phone numbers that are mixes of the two patterns, like (123 456-1235
, so I would stick to the two patterns.
Your functions is_color
and is_zip_code
seem broken to me. Since you skip over commas, "blue,green"
would be a valid one-word color and "50,364"
a valid ZIP-code.
I would use something like this:
def is_zip_code(token):
return re.match(r'\d{5}$', token) is not None
def is_color(token):
return re.match(r'[a-z]*$', token) is not None
represents_int
is now unneeded.
The former makes sure that token is a string of five digits and the latter makes sure that the token consists only of lower-case letters.
The function is_name
is more complicated. But I would use str.endswith
and exit early:
def is_name(name_tuple):
name = map(str.strip, name_tuple)
if not name[1].endswith(",")
return False
if not name[1][:-1].isalpha():
return False
if not (name[0].isalpha() or name[0].endswith(",") and name[0][:-1].isalpha()):
return False
return True
Which can be combined to:
def is_name(name_tuple):
name = map(str.strip, name_tuple)
return (name[1].endswith(",") and
name[1][:-1].isalpha() and
(name[0].isalpha() or
name[0].endswith(",") and name[0][:-1].isalpha()))
In retrospect, I don't understand why you insist on keeping the delimiter on the string in the tokenize
function. It seems like it would be way easier to drop it here and work with a tokenized list afterwards...
You could also just write one regex to rule them all (actually three, one each for each of your three input formats):
name_comma = r'[a-z]*, [a-z]*'
name_no_comma = r'[a-z]* [a-z]*'
phone_paren = r'\(\d{3}\)-\d{3}-\d{4}'
phone_space = r'\d{3} \d{3} \d{4}'
zip_code = r'\d{5}'
color = r'[a-z]*'
# Lastname, Firstname, (703)-742-0996, Blue, 10013
# Firstname Lastname, Red, 11237, 703 955 0373
# Firstname, Lastname, 10013, 646 111 0101, Green
acceptable_formats = [", ".join([name_comma, phone_paren, color, zip_code]),
", ".join([name_no_comma, color, zip_code, phone_space]),
", ".join([name_comma, zip_code, phone_space, color])]
def validate_line(line):
return any(re.match(pattern, line) is not None
for pattern in acceptable_formats)
-
\$\begingroup\$ Thanks for the suggestions. Regex is definitely a good idea to check the phone numbers and colors. I'm not sure why it didn't occur to me. \$\endgroup\$loremIpsum1771– loremIpsum17712017年04月05日 18:41:23 +00:00Commented Apr 5, 2017 at 18:41
I think most of us eventually encounter a CSV to JSON converter problem in our careers.
When I did something similar last time, I've used a csvschema
package (it is a bit outdated at the moment, but does the job). Defining your own "csv structure" class will conveniently encapsulate your field types and validation logic. The represents_int()
will be replaced with a built-in IntColumn
field. Other is_*
functions will be replaced with custom columns.
Or, at the very least, using csv
module might help with tokenizing part.
Some other notes about the code:
comma_first_case
andcomma_second_case
don't need to be defined asFalse
since you overwrite them later onnames[0] = names[0] + delim
can be rewritten asnames[0] += delim
- remove the extra spaces around
=
when passing keyword arguments - add an extra space after the commas when passing multiple arguments to functions
instead of manually supporting the
i
counter in theparse_file()
function, useenumerate()
:for line_number, line in enumerate(info_file): valid_line = self.validate_line(line) if valid_line: lines_dict[(valid_line["lastname"],valid_line["firstname"])] = valid_line else: errors.append(line_number)
you can use negative indexing, replacing
line[len(line)-1]
withline[-1]
- separate top-level function and class definitions with two blank lines
- you don't need to put the double underscores around your script file name
And, overall, really good job documenting the code. Note that now, when the code changes, you need to keep the documentation up-to-date with the code appropriately.
-
\$\begingroup\$ Thanks for the feedback! How are you suggesting that I use the
enumerate()
function? It seems like it just returns a list of values in the length of the object. I'm trying to just return the lines of the file where there are errors. \$\endgroup\$loremIpsum1771– loremIpsum17712017年04月05日 18:39:50 +00:00Commented Apr 5, 2017 at 18:39 -
1\$\begingroup\$ @loremIpsum1771 sure, updated with a sample. Thanks. \$\endgroup\$alecxe– alecxe2017年04月05日 18:41:29 +00:00Commented Apr 5, 2017 at 18:41
-
\$\begingroup\$ Oh ok. So here I'm assuming
for line_number, line in enumerate(info_file)
the indices of enumerate are being used as pointers so that you can passline
to thevalidate_line()
function but then also print outline_number
as an int? \$\endgroup\$loremIpsum1771– loremIpsum17712017年04月05日 18:54:05 +00:00Commented Apr 5, 2017 at 18:54 -
\$\begingroup\$ @loremIpsum1771 yup, it's a Pythonic way to have both indexes and actual items at the same time. \$\endgroup\$alecxe– alecxe2017年04月05日 18:55:09 +00:00Commented Apr 5, 2017 at 18:55
-
\$\begingroup\$ Oh ok, cool. Lastly, what did you mean by "you don't need to put the double underscores around your script file name" ? I only put the underscores around main.py. I made a follow up post with the changes btw. \$\endgroup\$loremIpsum1771– loremIpsum17712017年04月05日 19:03:29 +00:00Commented Apr 5, 2017 at 19:03