Validating a CSV list of contacts and convert it to JSON - Follow Up

Question 1

This is a follow up to my original post:

I've written a class that takes a file, validates the formatting of the lines from an input file and writes the set of valid lines to an output file. Each line of the file should have a first name, last name, phone number, color, and zip code. A zip code is valid if it has only 5 characters, a phone number can have only 10 digits (in addition to dashes/parentheses in appropriate places).

In the last post I received some good feedback and added in a few of the suggestions. The main changes I made were to the is_color() and is_phone_number() methods as I switched to using regex to validate the input strings. I needed to keep the is_name() function the same because it needed to check for a couple of different configurations and arrangements of the first and last name. I also kept validate_line() the same because the function needed to know the indices of each token of the line (i.e. the phone number, the color, and the zip code) and reformat the line if necessary, returning a dictionary of those values to be used by parse_file(). Also, I mainly wanted to focus on extensibility here as opposed to conciseness.

The code seems to be working but I just wanted to make sure that there was nothing that was being done inefficiently or that is not optimal stylistically.

The updated code is below:

""" file_formatter module
The class contained in this module validates a CSV file based on a set of internally 
specified accepted formats and generates a JSON file containing normalized forms of the
valid lines from the CSV file.
Example:
 The class in this module can be imported and passed an initial value for the input data
 file from the command line like this:
 $ python package_name name_of_data_file.in
Classes:
 FileFormatter: Takes an input file and output its valid lines to a result file.
"""
import json
import re
class FileFormatter:
 """ Takes an input file and output its valid lines to a result file.
 Validates the formatting of the lines from an input file and writes the set of valid lines
 to an output file.
 Attributes:
 info_configs: A list containing lists of "accepted" configurations of the data from each line of the input file.
 in_file_name: Name of the input file.
 res_file_name: Name of the output file.
 """
 info_configs = [["phone","color","zip"], ["color","zip","phone"], ["zip","phone","color"]] 
 def __init__(self,start_file_name,out_file_name):
 """Initialize FileFormatter class with the input and output file names."""
 self.in_file_name = start_file_name
 self.res_file_name = out_file_name
 def validate_line(self, line):
 """Validates that each line is in the correct format.
 Takes a line from a file, validate that the first two elements are properly formatted
 names, then validates that the remaining elements (phone number, zip code, color)
 in the line are properly formatted.
 Args:
 line: A line from a file
 Returns:
 A list of tokenized elements from the original line (string) in the correct order
 according to the specified format. For example:
 [Lastname, Firstname, (703)-742-0996, Blue, 10013] or
 [Firstname, Lastname, Red, 11237, 703 955 0373] or
 [Firstname, Lastname, 10013, 646 111 0101, Green]
 If a value of None is returned, some element in the line wasn't in the correct format. 
 """
 line = tokenize(line)
 if len(line) != 5:
 return None
 full_name = (line[0], line[1])
 if not is_name(full_name):
 return None
 entry = { "color": "", "firstname": "", "lastname": "", "phonenumber": "", "zipcode": ""}
 config = ["","",""]
 phone_idx = 0
 zip_idx = 0
 color_idx = 0
 for i in range(2,len(line)):
 if is_phone_number(line[i]):
 phone_idx = i-2
 config[phone_idx] = "phone"
 if is_zip_code(line[i]):
 zip_idx = i-2
 config[zip_idx] = "zip"
 if is_color(line[i]):
 color_idx = i-2
 config[color_idx] = "color"
 if config in self.info_configs: # if the phone number, zip code, and color have been found and are in correct order
 if phone_idx == 0:
 line[0], line[1] = line[1], line[0]
 line = [token.strip(" ") for token in line]
 line = [token.replace(",","") for token in line]
 line[-1] = line[-1].replace("\n","")
 entry["firstname"] = line[0]
 entry["lastname"] = line[1]
 entry["color"] = line[color_idx+2]
 entry["phonenumber"] = line[phone_idx+2]
 entry["zipcode"] = line[zip_idx+2]
 return entry
 return None
 def parse_file(self):
 """Parses an input file, validates the formatting of its lines, and writes a JSON file with the properly formatted lines.
 Iterates through the input file validating each line. Creates a dictionary that contains
 a list of entries comprised of valid lines from the input file. Creates a JSON object 
 of normalized data sorted in ascending order by a tuple of (lastname, firstname) for each line.
 """
 lines_dict = {}
 json_dict = {}
 errors = []
 with open(self.in_file_name,'r') as info_file:
 for line_number, line in enumerate(info_file):
 valid_line = self.validate_line(line)
 if valid_line: 
 lines_dict[(valid_line["lastname"],valid_line["firstname"])] = valid_line
 else:
 errors.append(line_number)
 json_dict["entries"] = [lines_dict[key] for key in sorted(lines_dict.keys(), reverse = True)] # sort by (lastname, firstname,) key value
 json_dict["errors"] = errors
 with open(self.res_file_name,'w') as out_file:
 json.dump(json_dict, out_file, indent = 2)
# utility methods used by the FileFormatter class
def is_phone_number(token):
 """Determines if the passed in string represents a properly formatted 10-digit phone number.
 Takes a string and validates that it matches one of two valid formats specified for a phone number.
 Uses regular expression parsing to validate that the sequence of characters is an pattern match to one of 
 the valid formats.
 Args:
 token: A fragment of a line of a file
 Returns:
 A boolean indicating if the string is a properly formatted phone number.
 """
 token = token.strip(" ")
 token = token.replace("\n","")
 token = token.replace(",","")
 return (re.match(r'\(\d{3}\)-\d{3}-\d{4}$', token) is not None or
 re.match(r'\d{3} \d{3} \d{4}$', token) is not None)
def is_zip_code(token):
 """Determines if the passed in string represents a properly formatted 5-digit zip code.
 Takes a string and using regular expression parsing to validate that it matches the valid
 format specified for a zip code. Validates that the string doesn't contain more than 5 numbers.
 Args:
 token: A fragment of a line of a file
 Returns:
 A boolean indicating if the string is a properly formatted zip code.
 """
 token = token.strip(" ")
 token = token.replace("\n","")
 token = token.replace(",","")
 return re.match(r'\d{5}$', token) is not None
def is_color(token):
 """Determines if the passed in string represents a color.
 Takes a string and uses regular expression parsing to validate that 
 it matches the valid format specified for a color. Validates that 
 it is only a one word color.
 Args:
 token: A fragment of a line of a file
 Returns:
 A boolean indicating if the string is a properly formatted color.
 """
 token = token.strip(" ")
 token = token.replace("\n","")
 token = token.replace(",","")
 return re.match(r'[a-z]*$', token) is not None
def is_name(name_tuple):
 """Determines if the first two elements in a file line (names) are correctly formatted.
 Takes a tuple of elements and validates that they match one of two valid formats. Either both 
 words end in a comma or the second one does while the first one doesn't.
 Args:
 name_tuple: A tuple of two elements (first and last name) from a line in a file
 Returns:
 A boolean indicating if the elements (names) in the tuple are correctly formatted.
 """
 names = (name_tuple[0].strip(" "), name_tuple[1].strip(" "))
 name1_comma = False
 name2_comma = False
 for i in range(2):
 curr_len = len(names[i]) 
 for j in range(curr_len):
 if not names[i][j].isalpha() and j < curr_len -1: 
 return False
 if j == curr_len - 1 and i == 0 and names[i][j] == ',':
 name1_comma = True
 if j == curr_len - 1 and i == 1 and names[i][j] == ',':
 name2_comma = True
 comma_first_case = name1_comma and name2_comma # both have commas
 comma_second_case = not name1_comma and name2_comma # name2 has comma, name 1 doesnt
 if not (comma_first_case or comma_second_case):
 return False
 return True
def tokenize(line):
 """Splits the passed in string on the delimiter and return a list of tokens.
 Takes a string and splits it on a delimter while maintaining the delimiter in its
 original position in the string. If the first word in the string doesn't end with a comma,
 the split operation will yield four tokens instead of five so the first two words (names) are broken
 up by the space character.
 Args:
 line: A string to be broken up into tokens based on a delimiter.
 Returns:
 A list of tokens (words) from the passed in line.
 """
 delim = ","
 tokens = [e + delim for e in line.split(delim) if e]
 if len(tokens) == 4:
 names = tokens[0].split(" ")
 names[0] += delim
 names[1] = " " + names[1]
 info = tokens[1:]
 tokens = []
 tokens.extend(names)
 tokens.extend(info)
 return tokens

Question 2

`is_name`: First comma is irrelevant

You say:

Either both words end in a comma or 
the second one does while the first one doesn't.

So the truth table is:

 First comma | Second comma | Output
 True | True | True
 False | True | True
 True | False | False
 False | False | False

Looking at this table it is clear that the return value of the function is_name is exactly the boolean value of Second comma so you can avoid the first comma variable altogether, as it is irrelevant:

def is_name(name_tuple):
 """Docstring..."""
 names = (name_tuple[0].strip(" "), name_tuple[1].strip(" "))
 name2_comma = False
 for i in range(2):
 curr_len = len(names[i]) 
 for j in range(curr_len):
 if not names[i][j].isalpha() and j < curr_len -1: 
 return False
 if j == curr_len - 1 and i == 1 and names[i][j] == ',':
 name2_comma = True
 return name2_comma

This is better because it is clearly simpler.

Question 3

Yeah this is definitely simpler. I'll make these changes. Thanks!

Question 4

Regexes to the rescue again! At the end of my answer to your last question, I arrived at a way to validate the whole line in one go. What I missed is that you actually need the different components to assign them to fields.

Enter capture groups. Just by putting () around (parts of) a regex, the content will be put into a group:

name_comma = r'([A-Z][a-z]*), ([A-Z][a-z]*)' 
name_no_comma = r'([A-Z][a-z]*) ([A-Z][a-z]*)'
phone_paren = r'(\(\d{3}\)-\d{3}-\d{4})'
phone_space = r'(\d{3} \d{3} \d{4})'
zip_code = r'(\d{5})'
color = r'([A-Z]?[a-z]*)'
# Lastname, Firstname, (703)-742-0996, Blue, 10013
# Firstname Lastname, Red, 11237, 703 955 0373
# Firstname, Lastname, 10013, 646 111 0101, Green
ACCEPTABLE_FORMATS = [", ".join([name_comma, phone_paren, color, zip_code]),
 ", ".join([name_no_comma, color, zip_code, phone_space]),
 ", ".join([name_comma, zip_code, phone_space, color])]
def validate_line(line):
 for pattern in ACCEPTABLE_FORMATS:
 match = re.match(pattern, line)
 if match is not None:
 return match.groups()

For the three given examples this returns these tuples:

>>> validate_line("Firstname Lastname, Red, 11237, 703 955 0373")
('Firstname', 'Lastname', 'Red', '11237', '703 955 0373')
>>> validate_line("Lastname, Firstname, (703)-742-0996, Blue, 10013")
('Lastname', 'Firstname', '(703)-742-0996', 'Blue', '10013')
>>> validate_line("Firstname, Lastname, 10013, 646 111 0101, Green")
('Firstname', 'Lastname', '10013', '646 111 0101', 'Green')

Having a look at https://docs.python.org/2/library/re.html does probably not hurt. You can even give the groups names, and get directly a dict out:

name_comma = r'(?P<lastname>[A-Z][a-z]*), (?P<firstname>[A-Z][a-z]*)' 
name_no_comma = r'(?P<firstname>[A-Z][a-z]*) (?P<lastname>[A-Z][a-z]*)'
phone_paren = r'(?P<phonenumber>\(\d{3}\)-\d{3}-\d{4})'
phone_space = r'(?P<phonenumber>\d{3} \d{3} \d{4})'
zip_code = r'(?P<zipcode>\d{5})'
color = r'(?P<color>[A-Z]?[a-z]*)'
...
def validate_line(line):
 for pattern in ACCEPTABLE_FORMATS:
 match = re.match(pattern, line)
 if match is not None:
 return match.groupdict()

With these dicts, you can just update your records.

>>> validate_line("Firstname Lastname, Red, 11237, 703 955 0373")
{'color': 'Red', 'lastname': 'Lastname', 'phonenumber': '703 955 0373', 'zipcode': '11237', 'firstname': 'Firstname'}
>>> validate_line("Lastname, Firstname, (703)-742-0996, Blue, 10013")
{'color': 'Blue', 'lastname': 'Lastname', 'phonenumber': '(703)-742-0996', 'zipcode': '10013', 'firstname': 'Firstname'}
>>> validate_line("Firstname, Lastname, 10013, 646 111 0101, Green")
{'color': 'Green', 'lastname': 'Firstname', 'phonenumber': '646 111 0101', 'zipcode': '10013', 'firstname': 'Lastname'}

Note that you could pre-compile your regexes, which, theoretically, should make them faster. But Python already does that for you and caches it, so it does not really change it. But you can, of course:

ACCEPTABLE_FORMATS = map(lambda s: re.compile(", ".join(s)), 
 [[name_comma, phone_paren, color, zip_code],
 [name_no_comma, color, zip_code, phone_space],
 [name_comma, zip_code, phone_space, color]])
def validate_line(line):
 for pattern in ACCEPTABLE_FORMATS:
 match = pattern.match(line)
 if match is not None:
 return match.groupdict()

Question 5

Do you think it would make sense to pre-compile the regexes beforehand? And, may be define them in upper-case since they are "constants"..thanks, great answer.

Question 6

@alecxe The latter, definitely, for the former, Python caches regexes, so you won't gain anything (have a look at stackoverflow.com/questions/452104/…)

Question 7

Right, I am thinking this way..we may expect the second, third and latter calls with the same regex be retrieved from cache..but, the first time re.match() is called the engine is going to compile the expression - this is going to happen during the runtime. But, if we call re.compile() when defining the expressions, we would do this during the compile time..hope that makes sense. Thanks.

Question 8

@alecxe True, if you do it yourself it will be compiled right at the beginning, so all calls are of the same time.

Question 9

It looks like this does handle the case for when the first and last names appear out of order, but would this design be considered extensible? Would it not be better to separate the individual pieces of functionality into separate methods?

Caridorc CaridorcCaridorc 28k7 gold badges54 silver badges137 bronze badges · Accepted Answer · 2017-04-05 20:05:25Z

`is_name`: First comma is irrelevant

You say:

Either both words end in a comma or 
the second one does while the first one doesn't.

So the truth table is:

 First comma | Second comma | Output
 True | True | True
 False | True | True
 True | False | False
 False | False | False

Looking at this table it is clear that the return value of the function is_name is exactly the boolean value of Second comma so you can avoid the first comma variable altogether, as it is irrelevant:

def is_name(name_tuple):
 """Docstring..."""
 names = (name_tuple[0].strip(" "), name_tuple[1].strip(" "))
 name2_comma = False
 for i in range(2):
 curr_len = len(names[i]) 
 for j in range(curr_len):
 if not names[i][j].isalpha() and j < curr_len -1: 
 return False
 if j == curr_len - 1 and i == 1 and names[i][j] == ',':
 name2_comma = True
 return name2_comma

This is better because it is clearly simpler.

Yeah this is definitely simpler. I'll make these changes. Thanks!

Stack Exchange Network

Validating a CSV list of contacts and convert it to JSON - Follow Up

2 Answers 2

`is_name`: First comma is irrelevant

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Hot Network Questions

Validating a CSV list of contacts and convert it to JSON - Follow Up

2 Answers 2

is_name: First comma is irrelevant

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Related

Hot Network Questions

`is_name`: First comma is irrelevant