Validating a CSV list of contacts and convert it to JSON

Question 1

I've written a class that takes a file, validates the formatting of the lines from an input file and writes the set of valid lines to an output file. Each line of the file should have a first name, last name, phone number, color, and zip code. A zip code is valid if it has only 5 characters, a phone number can have only 10 digits (in addition to dashes/parentheses in appropriate places). The accepted formats of each line of the input file are the following:

Lastname, Firstname, (703)-742-0996, Blue, 10013
Firstname Lastname, Red, 11237, 703 955 0373
Firstname, Lastname, 10013, 646 111 0101, Green

The program needs to write a JSON object with all of the valid lines from the input file in a list sorted in ascending alphabetical order by (last name, first name).

These are the test cases I ran with it as well as the JSON output. I think I've identified all of the edge cases with the tests but I could have missed something. This code should exemplify good design choices and extensibility and should be production quality. Should anything be added/removed from the solution to meet these requirements?

Also, any tests that would make the code fail are welcome.

The code for the solution is below:

__main__.py

import sys
from file_formatter import FileFormatter
if __name__ == "__main__":
 formatter = FileFormatter(sys.argv[-1],"result.out")
 formatter.parse_file()

file_formatter.py

""" file_formatter module
The class contained in this module validates a CSV file based on a set of internally 
specified accepted formats and generates a JSON file containing normalized forms of the
valid lines from the CSV file.
Example:
 The class in this module can be imported and passed an initial value for the input data
 file from the command line like this:
 $ python example_program.py name_of_data_file.in
Classes:
 FileFormatter: Takes an input file and output its valid lines to a result file.
"""
import json
class FileFormatter:
 """ Takes an input file and output its valid lines to a result file.
 Validates the formatting of the lines from an input file and writes the set of valid lines
 to an output file.
 Attributes:
 info_configs: A list containing lists of "accepted" configurations of the data from each line of the input file.
 in_file_name: Name of the input file.
 res_file_name: Name of the output file.
 """
 info_configs = [["phone","color","zip"], ["color","zip","phone"], ["zip","phone","color"]] 
 def __init__(self,start_file_name,out_file_name):
 """Initialize FileFormatter class with the input and output file names."""
 self.in_file_name = start_file_name
 self.res_file_name = out_file_name
 def validate_line(self, line):
 """Validates that each line is in the correct format.
 Takes a line from a file, validate that the first two elements are properly formatted
 names, then validates that the remaining elements (phone number, zip code, color)
 in the line are properly formatted.
 Args:
 line: A line from a file
 Returns:
 A list of tokenized elements from the original line (string) in the correct order
 according to the specified format. For example:
 [Lastname, Firstname, (703)-742-0996, Blue, 10013] or
 [Firstname, Lastname, Red, 11237, 703 955 0373] or
 [Firstname, Lastname, 10013, 646 111 0101, Green]
 If a value of None is returned, some element in the line wasn't in the correct format. 
 """
 line = tokenize(line)
 if len(line) != 5:
 return None
 full_name = (line[0],line[1])
 if not is_name(full_name):
 return None
 config = ["","",""]
 entry = { "color": "", "firstname": "", "lastname": "", "phonenumber": "", "zipcode": ""}
 phone_idx = 0
 zip_idx = 0
 color_idx = 0
 for i in range(2,len(line)):
 if is_phone_number(line[i]):
 phone_idx = i-2
 config[phone_idx] = "phone"
 if is_zip_code(line[i]):
 zip_idx = i-2
 config[zip_idx] = "zip"
 if is_color(line[i]):
 color_idx = i-2
 config[color_idx] = "color"
 if config in self.info_configs: # if the phone number, zip code, and color have been found and are in correct order
 if phone_idx == 0:
 line[0], line[1] = line[1], line[0]
 line = [token.strip(" ") for token in line]
 line = [token.replace(",","") for token in line]
 line[len(line)-1] = line[len(line)-1].replace("\n","")
 entry["firstname"] = line[0]
 entry["lastname"] = line[1]
 entry["color"] = line[color_idx+2]
 entry["phonenumber"] = line[phone_idx+2]
 entry["zipcode"] = line[zip_idx+2]
 return entry
 return None
 def parse_file(self):
 """Parses an input file, validates the formatting of its lines, and writes a JSON file with the properly formatted lines.
 Iterates through the input file validating each line. Creates a dictionary that contains
 a list of entries comprised of valid lines from the input file. Creates a JSON object 
 of normalized data sorted in ascending order by a tuple of (lastname, firstname) for each line.
 """
 lines_dict = {}
 json_dict = {}
 errors = []
 with open(self.in_file_name,'r') as info_file:
 i = 0
 for line in info_file:
 valid_line = self.validate_line(line)
 if valid_line: 
 lines_dict[(valid_line["lastname"],valid_line["firstname"])] = valid_line
 else:
 errors.append(i)
 i += 1
 json_dict["entries"] = [lines_dict[key] for key in sorted(lines_dict.keys(), reverse = True)] # sort by (lastname, firstname,) key value
 json_dict["errors"] = errors
 with open(self.res_file_name,'w') as out_file:
 json.dump(json_dict, out_file, indent = 2)
# utility methods for parsing the file
def tokenize(line):
 """Splits the passed in string on the delimiter and return a list of tokens.
 Takes a string and splits it on a delimter while maintaining the delimiter in its
 original position in the string. If the first word in the string doesn't end with a comma,
 the split operation will yield four tokens instead of five so the first two words (names) are broken
 up by the space character.
 Args:
 line: A string to be broken up into tokens based on a delimiter.
 Returns:
 A list of tokens (words) from the passed in line.
 """
 delim = ","
 tokens = [e + delim for e in line.split(delim) if e]
 if len(tokens) == 4:
 names = tokens[0].split(" ")
 names[0] = names[0] + delim
 names[1] = " " + names[1]
 info = tokens[1:]
 tokens = []
 tokens.extend(names)
 tokens.extend(info)
 return tokens
def is_name(name_tuple):
 """Determines if the first two elements in a file line (names) are correctly formatted.
 Takes a tuple of elements and validates that they match one of two valid formats. Either both 
 words end in a comma or the second one does while the first one doesn't.
 Args:
 name_tuple: A tuple of two elements (first and last name) from a line in a file
 Returns:
 A boolean indicating if the elements (names) in the tuple are correctly formatted.
 """
 names = (name_tuple[0].strip(" "), name_tuple[1].strip(" "))
 comma_first_case = False
 comma_second_case = False
 name1_comma = False
 name2_comma = False
 for i in range(2):
 curr_len = len(names[i]) 
 for j in range(curr_len):
 if not names[i][j].isalpha() and j < curr_len -1: 
 return False
 if j == curr_len - 1 and i == 0 and names[i][j] == ',':
 name1_comma = True
 if j == curr_len - 1 and i == 1 and names[i][j] == ',':
 name2_comma = True
 comma_first_case = name1_comma and name2_comma # both have commas
 comma_second_case = not name1_comma and name2_comma # name2 has comma, name 1 doesnt
 if not (comma_first_case or comma_second_case):
 return False
 return True
def is_phone_number(token):
 """Determines if the passed in string represents a properly formatted 10-digit phone number.
 Takes a string and validates that it matches one of two valid formats specified for a phone number.
 Validates that the sequence of characters is an exact match to one of the valid formats.
 Args:
 token: A fragment of a line of a file
 Returns:
 A boolean indicating if the string is a properly formatted phone number.
 """
 token = token.strip(" ")
 char_sequence = [] 
 case_1 = ["paren","number","number","number","paren","dash","number","number","number","dash","number","number","number","number"]
 case_2 = ["number","number","number","space","number","number","number","space","number","number","number","number"]
 for char in token:
 is_paren = char == "(" or char == ")"
 is_dash = char == "-"
 is_ws = char == " "
 if represents_int(char):
 char_sequence.append("number")
 if is_paren:
 char_sequence.append("paren")
 if is_dash:
 char_sequence.append("dash")
 if is_ws:
 char_sequence.append("space")
 if char_sequence == case_1 or char_sequence == case_2:
 return True
 return False 
def is_color(token):
 """Determines if the passed in string represents a color.
 Takes a string and validates that it matches the valid formats specified for a color.
 Validates that it is only a one word color.
 Args:
 token: A fragment of a line of a file
 Returns:
 A boolean indicating if the string is a properly formatted color.
 """
 token = token.strip(" ")
 for i in range(len(token)):
 if token[i] != "," and token[i] != "\n":
 if not token[i].isalpha() or not token[i].islower() :
 return False
 return True
def is_zip_code(token):
 """Determines if the passed in string represents a properly formatted 5-digit zip code.
 Takes a string and validates that it matches the valid formats specified for a zip code.
 Validates that the string doesn't contain more than 5 numbers.
 Args:
 token: A fragment of a line of a file
 Returns:
 A boolean indicating if the string is a properly formatted zip code.
 """
 token = token.strip(" ")
 digit_count = 0
 for digit in token:
 if digit != "," and digit != "\n":
 if represents_int(digit):
 digit_count += 1
 else:
 return False
 if digit_count != 5:
 return False
 return True
def represents_int(char):
 """Determines if the passed in character represents an integer.
 Takes a char and attempts to convert it to an integer.
 Args:
 char: A character
 Returns:
 A boolean indicating if the passed in character represents an integer.
 Raises:
 ValueError: An error occured when trying to convert the character to an integer.
 """
 try: 
 int(char)
 return True
 except ValueError:
 return False
if __name__ == "__main__":
 formatter= FileFormatter("data.in","result.out")
 formatter.parse_file()

Question 2

hello you can use python export emails from gmail contacts csv project from git look github.com/bestofg/python-export-gmail-emails-from-contacts-csv

Question 3

Your function is_phone_number is the prime example for the usage of regular expressions. You are basically trying to implement it yourself here!

You can either use two different patterns here:

import re
def is_phone_number(token):
 token = token.strip(" ")
 return (re.match(r'\(\d{3}\)-\d{3}-\d{4}$', token) is not None or
 re.match(r'\d{3} \d{3} \d{4}$', token) is not None)

Here, \d is any digit, \d{n} is a run of n digits and $ is the end of the string (to make sure there is nothing after the valid phone number).

You could also combine it to one pattern:

def is_phone_number(token):
 token = token.strip(" ")
 return re.match(r'\(?\d{3}\)?[ -]\d{3}[ -]\d{4}$', token) is not None

This second pattern has the caveat, that it allows phone numbers that are mixes of the two patterns, like (123 456-1235, so I would stick to the two patterns.

Your functions is_color and is_zip_code seem broken to me. Since you skip over commas, "blue,green" would be a valid one-word color and "50,364" a valid ZIP-code.

I would use something like this:

def is_zip_code(token):
 return re.match(r'\d{5}$', token) is not None
def is_color(token):
 return re.match(r'[a-z]*$', token) is not None

represents_int is now unneeded.

The former makes sure that token is a string of five digits and the latter makes sure that the token consists only of lower-case letters.

The function is_name is more complicated. But I would use str.endswith and exit early:

def is_name(name_tuple):
 name = map(str.strip, name_tuple)
 if not name[1].endswith(",")
 return False
 if not name[1][:-1].isalpha():
 return False
 if not (name[0].isalpha() or name[0].endswith(",") and name[0][:-1].isalpha()):
 return False
 return True

Which can be combined to:

def is_name(name_tuple):
 name = map(str.strip, name_tuple)
 return (name[1].endswith(",") and
 name[1][:-1].isalpha() and
 (name[0].isalpha() or 
 name[0].endswith(",") and name[0][:-1].isalpha()))

In retrospect, I don't understand why you insist on keeping the delimiter on the string in the tokenize function. It seems like it would be way easier to drop it here and work with a tokenized list afterwards...

You could also just write one regex to rule them all (actually three, one each for each of your three input formats):

name_comma = r'[a-z]*, [a-z]*' 
name_no_comma = r'[a-z]* [a-z]*'
phone_paren = r'\(\d{3}\)-\d{3}-\d{4}'
phone_space = r'\d{3} \d{3} \d{4}'
zip_code = r'\d{5}'
color = r'[a-z]*'
# Lastname, Firstname, (703)-742-0996, Blue, 10013
# Firstname Lastname, Red, 11237, 703 955 0373
# Firstname, Lastname, 10013, 646 111 0101, Green
acceptable_formats = [", ".join([name_comma, phone_paren, color, zip_code]),
 ", ".join([name_no_comma, color, zip_code, phone_space]),
 ", ".join([name_comma, zip_code, phone_space, color])]
def validate_line(line):
 return any(re.match(pattern, line) is not None
 for pattern in acceptable_formats)

Question 4

Thanks for the suggestions. Regex is definitely a good idea to check the phone numbers and colors. I'm not sure why it didn't occur to me.

Question 5

I think most of us eventually encounter a CSV to JSON converter problem in our careers.

When I did something similar last time, I've used a csvschema package (it is a bit outdated at the moment, but does the job). Defining your own "csv structure" class will conveniently encapsulate your field types and validation logic. The represents_int() will be replaced with a built-in IntColumn field. Other is_* functions will be replaced with custom columns.

Or, at the very least, using csv module might help with tokenizing part.

Some other notes about the code:

comma_first_case and comma_second_case don't need to be defined as False since you overwrite them later on
names[0] = names[0] + delim can be rewritten as names[0] += delim
remove the extra spaces around = when passing keyword arguments
add an extra space after the commas when passing multiple arguments to functions

instead of manually supporting the i counter in the parse_file() function, use enumerate():

for line_number, line in enumerate(info_file):
 valid_line = self.validate_line(line)
 if valid_line: 
 lines_dict[(valid_line["lastname"],valid_line["firstname"])] = valid_line
 else:
 errors.append(line_number)

you can use negative indexing, replacing line[len(line)-1] with line[-1]
separate top-level function and class definitions with two blank lines
you don't need to put the double underscores around your script file name

And, overall, really good job documenting the code. Note that now, when the code changes, you need to keep the documentation up-to-date with the code appropriately.

Question 6

Thanks for the feedback! How are you suggesting that I use the enumerate() function? It seems like it just returns a list of values in the length of the object. I'm trying to just return the lines of the file where there are errors.

Question 7

@loremIpsum1771 sure, updated with a sample. Thanks.

Question 8

Oh ok. So here I'm assuming for line_number, line in enumerate(info_file) the indices of enumerate are being used as pointers so that you can pass line to the validate_line() function but then also print out line_number as an int?

Question 9

@loremIpsum1771 yup, it's a Pythonic way to have both indexes and actual items at the same time.

Question 10

Oh ok, cool. Lastly, what did you mean by "you don't need to put the double underscores around your script file name" ? I only put the underscores around main.py. I made a follow up post with the changes btw.

Graipher Graipher 41.6k7 gold badges70 silver badges134 bronze badges · Accepted Answer · 2017-04-05 14:52:44Z

Your function is_phone_number is the prime example for the usage of regular expressions. You are basically trying to implement it yourself here!

You can either use two different patterns here:

import re
def is_phone_number(token):
 token = token.strip(" ")
 return (re.match(r'\(\d{3}\)-\d{3}-\d{4}$', token) is not None or
 re.match(r'\d{3} \d{3} \d{4}$', token) is not None)

Here, \d is any digit, \d{n} is a run of n digits and $ is the end of the string (to make sure there is nothing after the valid phone number).

You could also combine it to one pattern:

def is_phone_number(token):
 token = token.strip(" ")
 return re.match(r'\(?\d{3}\)?[ -]\d{3}[ -]\d{4}$', token) is not None

This second pattern has the caveat, that it allows phone numbers that are mixes of the two patterns, like (123 456-1235, so I would stick to the two patterns.

Your functions is_color and is_zip_code seem broken to me. Since you skip over commas, "blue,green" would be a valid one-word color and "50,364" a valid ZIP-code.

I would use something like this:

def is_zip_code(token):
 return re.match(r'\d{5}$', token) is not None
def is_color(token):
 return re.match(r'[a-z]*$', token) is not None

represents_int is now unneeded.

The former makes sure that token is a string of five digits and the latter makes sure that the token consists only of lower-case letters.

The function is_name is more complicated. But I would use str.endswith and exit early:

def is_name(name_tuple):
 name = map(str.strip, name_tuple)
 if not name[1].endswith(",")
 return False
 if not name[1][:-1].isalpha():
 return False
 if not (name[0].isalpha() or name[0].endswith(",") and name[0][:-1].isalpha()):
 return False
 return True

Which can be combined to:

def is_name(name_tuple):
 name = map(str.strip, name_tuple)
 return (name[1].endswith(",") and
 name[1][:-1].isalpha() and
 (name[0].isalpha() or 
 name[0].endswith(",") and name[0][:-1].isalpha()))

In retrospect, I don't understand why you insist on keeping the delimiter on the string in the tokenize function. It seems like it would be way easier to drop it here and work with a tokenized list afterwards...

You could also just write one regex to rule them all (actually three, one each for each of your three input formats):

name_comma = r'[a-z]*, [a-z]*' 
name_no_comma = r'[a-z]* [a-z]*'
phone_paren = r'\(\d{3}\)-\d{3}-\d{4}'
phone_space = r'\d{3} \d{3} \d{4}'
zip_code = r'\d{5}'
color = r'[a-z]*'
# Lastname, Firstname, (703)-742-0996, Blue, 10013
# Firstname Lastname, Red, 11237, 703 955 0373
# Firstname, Lastname, 10013, 646 111 0101, Green
acceptable_formats = [", ".join([name_comma, phone_paren, color, zip_code]),
 ", ".join([name_no_comma, color, zip_code, phone_space]),
 ", ".join([name_comma, zip_code, phone_space, color])]
def validate_line(line):
 return any(re.match(pattern, line) is not None
 for pattern in acceptable_formats)

Thanks for the suggestions. Regex is definitely a good idea to check the phone numbers and colors. I'm not sure why it didn't occur to me.

Stack Exchange Network

Validating a CSV list of contacts and convert it to JSON

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Hot Network Questions

Validating a CSV list of contacts and convert it to JSON

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Related

Hot Network Questions