Comparing phone numbers across CSVs Python

Question 1

(continuation from Speeding up and fixing phone numbers from CSVs with Regex)

I'm pulling all of the phone numbers from all CSVs in two different directories, outputting them in a single simple format to two different files, and then comparing those two files for which numbers are in one but not the other.

I'm failing in speed (of execution), style, and results. Here's what I was trying:

import csv
import re
import glob
import string
with open('Firstlist.csv', 'wb') as out:
 with open('Secondlist.csv', 'wb') as out2:
 with open('SecondnotinFirst.csv', 'wb') as out3:
 with open('FirstnotinSecond.csv', 'wb') as out4:
 seen = set()
 regex = re.compile(r'(\+?[2-9]\d{2}\)?[ -]?\d{3}[ -]?\d{4})')
 out_writer = csv.writer(out)
 out_writer.writerow([])
 csv_files = glob.glob('First\*.csv')
 for filename in csv_files:
 with open(filename, 'rbU') as ifile:
 read = csv.reader(ifile)
 for row in read:
 for column in row:
 s1 = column.strip()
 match = regex.search(s1)
 if match:
 canonical_phone = re.sub(r'\D', '', match.group(0))
 if canonical_phone not in seen:
 seen.add(canonical_phone)
 for val in seen:
 out_writer.writerow([val])
 seen2 = set()
 out_writer2 = csv.writer(out2)
 out_writer2.writerow([])
 csv_files2 = glob.glob('Second\*.csv')
 for filename in csv_files2:
 with open(filename, 'rbU') as ifile2:
 read2 = csv.reader(ifile2)
 for row in read2:
 for column in row:
 s2 = column.strip()
 match2 = regex.search(s2)
 if match2:
 canonical_phone2 = re.sub(r'\D', '', match2.group(0))
 if canonical_phone2 not in seen2:
 seen2.add(canonical_phone2)
 for val in seen2:
 out_writer2.writerow([val])
 out_writer3 = csv.writer(out3)
 for val in seen2 not in seen:
 out_writer3.writerow([val])
 out_writer4 = csv.writer(out4)
 for val in seen not in seen2:
 out_writer4.writerow([val])

Question 2

Some of the operations on built-in sets might make your code simpler (and possibly faster): docs.python.org/2/library/stdtypes.html#set

Question 3

Here's my 5 cents as non-python guy

Naming:

As soon as you begin numbering variable names, there should be alarm bells ringing... multiple loud alarm bells.

rename your out, out2, and the corresponding writers. Instead use descriptive names (maybe similar to your "filenames"): first_list, second_list, and so on. The same applies for your writers.

Also seen2 is not a good name.. What do you store in that variable? Name it after that: second_seen is definitely better, than seen2

Question 4

These kinds of general tips make codereview so great for someone like myself who has no clue what I'm doing. It's embarrassing to chalk up my line-by-line blindly-wandering code, but I only learn from trial and error.

Question 5

Hopefully I understand what your trying to do...

Take all phone numbers from all files in directory A, strip phone numbers out, storing it in a summary file.
Do the same for directory B.
Get a file of what is not in directory A, and another that is not in directory B.

I would break the problem down into a few steps:

Process each file in directory A, save the summary.
Do the same for directory B.
Compare A_summary with B_summary.
Compare B_summary with A_summary.

Here is what I would change the code to in order to get going toward a solution.

import csv
import re
import os
PHONE_PATTERN = re.compile(r'(\+?[2-9]\d{2}\)?[ -]?\d{3}[ -]?\d{4})')
def process_file(filename, phone_numbers=None):
 """ Add phone numbers from a file to the set of phone_numbers. """
 if phone_numbers = None:
 phone_numbers = set()
 with open(filename, 'rbU') as ifile:
 read = csv.reader(ifile)
 for row in read:
 for column in row:
 s1 = column.strip()
 match = PHONE_PATTERN.search(s1)
 if match:
 canonical_phone = re.sub(r'\D', '', match.group(0))
 if canonical_phone not in phone_numbers:
 phone_numbers.add(canonical_phone)
 return phone_numbers
# Check each directory, creating a set to tally up the unique numbers.
for directory in ['directory_A_path', 'directory_B_path']:
 phone_numbers = set()
 # Process all the files in a directory
 for filename in os.listdir(directory):
 if(filename.endswith(".csv")):
 phone_numbers = process_file(filename, phone_numbers)
 # Create a summary file for the directory and write the numbers
 with open(directory + "_summary.csv") as summary:
 summary_writer = csv.writer(summary)
 for phone_number in phone_numbers:
 summary_writer.writerow([phone_number])
summary_A = process_file('directory_A_path_summary.csv')
summary_B = process_file('directory_A_path_summary.csv')

I think you could then diff summaries and write them out. Hopefully it is helpful.

Question 6

A simple way to avoid excessive indentation is to combine all of the with blocks into one:

with open('List1.csv', 'wb') as out1, \
 open('List2.csv', 'wb') as out2, \
 open('List3.csv', 'wb') as out3, \
 open('List4.csv', 'wb') as out4:
 seen = set()
 # etc.

I've renamed out → out1 for parallelism.

However, another problem is that the code is repetitive and monolithic. Ideally, there should be a function, say read_unique_phone_numbers_from_csv_files(glob). There is a handy data structure for that: collections.OrderedDict.

def read_unique_phone_numbers_from_csv_files(csv_glob):
 regex = re.compile(r'(\+?[2-9]\d{2}\)?[ -]?\d{3}[ -]?\d{4})')
 unique_phones = collections.OrderedDict()
 csv_files = glob.glob(csv_glob)
 for filename in csv_files:
 with open(filename, 'rbU') as ifile:
 read = csv.reader(ifile)
 for row in read:
 for column in row:
 # column.strip() is superfluous
 match = regex.search(column)
 if match:
 canonical_phone = re.sub(r'\D', '', match.group(0))
 unique_phones[canonical_phone] = True
 return unique_phones

With that function defined, the main loop should become quite simple.

Question 7

Try to avoid deeply nesting code as much as possible. Deeply nested code has high cyclomatic complexity -- too many possible execution paths. The higher complexity, the harder to test, more error prone, fragile.

The deep nesting of out...out3 file handlers is especially bad, because they are independent from each other. Try to write this more "horizontally", without nesting them within each other.

Another big issue with your code is duplicated logic:

You could write the logic reading from a file, and use it twice when you read the two input files
You could write the logic writing to a file, and use it 4 times when you write the output files

If you do something more than once, you should extract to a separate method so it's less typing, and if you have to fix a bug, you can fix it once, which is easier and more robust.

This is pointless, you're not using it, so remove it:

import string

Putting it all together (untested):

def read_unique_numbers_from_reader(reader):
 unique_numbers = set()
 for row in reader:
 for column in row:
 s1 = column.strip()
 match = regex.search(s1)
 if match:
 canonical_phone = re.sub(r'\D', '', match.group(0))
 unique_numbers.add(canonical_phone)
 return unique_numbers
def read_unique_numbers_from_files(glob_pattern):
 unique_numbers = set()
 for filename in glob.glob(glob_pattern):
 with open(filename, 'rbU') as ifile:
 reader = csv.reader(ifile)
 unique_numbers.update(read_unique_numbers_from_reader(reader))
 return unique_numbers
def write_numbers(filename, numbers):
 with open(filename, 'wb') as out:
 out_writer = csv.writer(out)
 out_writer.writerow([])
 for val in numbers:
 out_writer.writerow([val])
numbers_in_first = read_unique_numbers_from_files('First*.csv')
write_numbers('Firstlist.csv', numbers_in_first)
numbers_in_second = read_unique_numbers_from_files('Second*.csv')
write_numbers('Secondlist.csv', numbers_in_second)
write_numbers('SecondnotinFirst.csv', numbers_in_second - numbers_in_first)
write_numbers('FirstnotinSecond.csv', numbers_in_first - numbers_in_second)

Vogel612 Vogel612 25.5k7 gold badges59 silver badges141 bronze badges · Answer 1 · 2014-04-15 08:32:26Z

Here's my 5 cents as non-python guy

Naming:

As soon as you begin numbering variable names, there should be alarm bells ringing... multiple loud alarm bells.

rename your out, out2, and the corresponding writers. Instead use descriptive names (maybe similar to your "filenames"): first_list, second_list, and so on. The same applies for your writers.

Also seen2 is not a good name.. What do you store in that variable? Name it after that: second_seen is definitely better, than seen2

These kinds of general tips make codereview so great for someone like myself who has no clue what I'm doing. It's embarrassing to chalk up my line-by-line blindly-wandering code, but I only learn from trial and error.

clutton clutton 8398 silver badges13 bronze badges · Answer 2 · 2014-04-16 02:00:15Z

Hopefully I understand what your trying to do...

Take all phone numbers from all files in directory A, strip phone numbers out, storing it in a summary file.
Do the same for directory B.
Get a file of what is not in directory A, and another that is not in directory B.

I would break the problem down into a few steps:

Process each file in directory A, save the summary.
Do the same for directory B.
Compare A_summary with B_summary.
Compare B_summary with A_summary.

Here is what I would change the code to in order to get going toward a solution.

import csv
import re
import os
PHONE_PATTERN = re.compile(r'(\+?[2-9]\d{2}\)?[ -]?\d{3}[ -]?\d{4})')
def process_file(filename, phone_numbers=None):
 """ Add phone numbers from a file to the set of phone_numbers. """
 if phone_numbers = None:
 phone_numbers = set()
 with open(filename, 'rbU') as ifile:
 read = csv.reader(ifile)
 for row in read:
 for column in row:
 s1 = column.strip()
 match = PHONE_PATTERN.search(s1)
 if match:
 canonical_phone = re.sub(r'\D', '', match.group(0))
 if canonical_phone not in phone_numbers:
 phone_numbers.add(canonical_phone)
 return phone_numbers
# Check each directory, creating a set to tally up the unique numbers.
for directory in ['directory_A_path', 'directory_B_path']:
 phone_numbers = set()
 # Process all the files in a directory
 for filename in os.listdir(directory):
 if(filename.endswith(".csv")):
 phone_numbers = process_file(filename, phone_numbers)
 # Create a summary file for the directory and write the numbers
 with open(directory + "_summary.csv") as summary:
 summary_writer = csv.writer(summary)
 for phone_number in phone_numbers:
 summary_writer.writerow([phone_number])
summary_A = process_file('directory_A_path_summary.csv')
summary_B = process_file('directory_A_path_summary.csv')

I think you could then diff summaries and write them out. Hopefully it is helpful.

200_success 200_success 145k22 gold badges190 silver badges478 bronze badges · Answer 3 · 2014-04-16 06:39:53Z

A simple way to avoid excessive indentation is to combine all of the with blocks into one:

with open('List1.csv', 'wb') as out1, \
 open('List2.csv', 'wb') as out2, \
 open('List3.csv', 'wb') as out3, \
 open('List4.csv', 'wb') as out4:
 seen = set()
 # etc.

I've renamed out → out1 for parallelism.

However, another problem is that the code is repetitive and monolithic. Ideally, there should be a function, say read_unique_phone_numbers_from_csv_files(glob). There is a handy data structure for that: collections.OrderedDict.

def read_unique_phone_numbers_from_csv_files(csv_glob):
 regex = re.compile(r'(\+?[2-9]\d{2}\)?[ -]?\d{3}[ -]?\d{4})')
 unique_phones = collections.OrderedDict()
 csv_files = glob.glob(csv_glob)
 for filename in csv_files:
 with open(filename, 'rbU') as ifile:
 read = csv.reader(ifile)
 for row in read:
 for column in row:
 # column.strip() is superfluous
 match = regex.search(column)
 if match:
 canonical_phone = re.sub(r'\D', '', match.group(0))
 unique_phones[canonical_phone] = True
 return unique_phones

With that function defined, the main loop should become quite simple.

janos janos 113k15 gold badges154 silver badges396 bronze badges · Answer 4 · 2014-04-16 06:04:33Z

Try to avoid deeply nesting code as much as possible. Deeply nested code has high cyclomatic complexity -- too many possible execution paths. The higher complexity, the harder to test, more error prone, fragile.

The deep nesting of out...out3 file handlers is especially bad, because they are independent from each other. Try to write this more "horizontally", without nesting them within each other.

Another big issue with your code is duplicated logic:

You could write the logic reading from a file, and use it twice when you read the two input files
You could write the logic writing to a file, and use it 4 times when you write the output files

If you do something more than once, you should extract to a separate method so it's less typing, and if you have to fix a bug, you can fix it once, which is easier and more robust.

This is pointless, you're not using it, so remove it:

import string

Putting it all together (untested):

def read_unique_numbers_from_reader(reader):
 unique_numbers = set()
 for row in reader:
 for column in row:
 s1 = column.strip()
 match = regex.search(s1)
 if match:
 canonical_phone = re.sub(r'\D', '', match.group(0))
 unique_numbers.add(canonical_phone)
 return unique_numbers
def read_unique_numbers_from_files(glob_pattern):
 unique_numbers = set()
 for filename in glob.glob(glob_pattern):
 with open(filename, 'rbU') as ifile:
 reader = csv.reader(ifile)
 unique_numbers.update(read_unique_numbers_from_reader(reader))
 return unique_numbers
def write_numbers(filename, numbers):
 with open(filename, 'wb') as out:
 out_writer = csv.writer(out)
 out_writer.writerow([])
 for val in numbers:
 out_writer.writerow([val])
numbers_in_first = read_unique_numbers_from_files('First*.csv')
write_numbers('Firstlist.csv', numbers_in_first)
numbers_in_second = read_unique_numbers_from_files('Second*.csv')
write_numbers('Secondlist.csv', numbers_in_second)
write_numbers('SecondnotinFirst.csv', numbers_in_second - numbers_in_first)
write_numbers('FirstnotinSecond.csv', numbers_in_first - numbers_in_second)

Stack Exchange Network

Comparing phone numbers across CSVs Python

4 Answers 4

Naming:

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Hot Network Questions

Comparing phone numbers across CSVs Python

4 Answers 4

Naming:

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Related

Hot Network Questions