5
\$\begingroup\$

(continuation from Speeding up and fixing phone numbers from CSVs with Regex)

I'm pulling all of the phone numbers from all CSVs in two different directories, outputting them in a single simple format to two different files, and then comparing those two files for which numbers are in one but not the other.

I'm failing in speed (of execution), style, and results. Here's what I was trying:

import csv
import re
import glob
import string
with open('Firstlist.csv', 'wb') as out:
 with open('Secondlist.csv', 'wb') as out2:
 with open('SecondnotinFirst.csv', 'wb') as out3:
 with open('FirstnotinSecond.csv', 'wb') as out4:
 seen = set()
 regex = re.compile(r'(\+?[2-9]\d{2}\)?[ -]?\d{3}[ -]?\d{4})')
 out_writer = csv.writer(out)
 out_writer.writerow([])
 csv_files = glob.glob('First\*.csv')
 for filename in csv_files:
 with open(filename, 'rbU') as ifile:
 read = csv.reader(ifile)
 for row in read:
 for column in row:
 s1 = column.strip()
 match = regex.search(s1)
 if match:
 canonical_phone = re.sub(r'\D', '', match.group(0))
 if canonical_phone not in seen:
 seen.add(canonical_phone)
 for val in seen:
 out_writer.writerow([val])
 seen2 = set()
 out_writer2 = csv.writer(out2)
 out_writer2.writerow([])
 csv_files2 = glob.glob('Second\*.csv')
 for filename in csv_files2:
 with open(filename, 'rbU') as ifile2:
 read2 = csv.reader(ifile2)
 for row in read2:
 for column in row:
 s2 = column.strip()
 match2 = regex.search(s2)
 if match2:
 canonical_phone2 = re.sub(r'\D', '', match2.group(0))
 if canonical_phone2 not in seen2:
 seen2.add(canonical_phone2)
 for val in seen2:
 out_writer2.writerow([val])
 out_writer3 = csv.writer(out3)
 for val in seen2 not in seen:
 out_writer3.writerow([val])
 out_writer4 = csv.writer(out4)
 for val in seen not in seen2:
 out_writer4.writerow([val])
asked Apr 15, 2014 at 8:01
\$\endgroup\$
1

4 Answers 4

3
\$\begingroup\$

Here's my 5 cents as non-python guy

Naming:

As soon as you begin numbering variable names, there should be alarm bells ringing... multiple loud alarm bells.

rename your out, out2, and the corresponding writers. Instead use descriptive names (maybe similar to your "filenames"): first_list, second_list, and so on. The same applies for your writers.

Also seen2 is not a good name.. What do you store in that variable? Name it after that: second_seen is definitely better, than seen2

answered Apr 15, 2014 at 8:32
\$\endgroup\$
1
  • \$\begingroup\$ These kinds of general tips make codereview so great for someone like myself who has no clue what I'm doing. It's embarrassing to chalk up my line-by-line blindly-wandering code, but I only learn from trial and error. \$\endgroup\$ Commented Apr 16, 2014 at 16:39
3
\$\begingroup\$

Hopefully I understand what your trying to do...

  1. Take all phone numbers from all files in directory A, strip phone numbers out, storing it in a summary file.
  2. Do the same for directory B.
  3. Get a file of what is not in directory A, and another that is not in directory B.

I would break the problem down into a few steps:

  1. Process each file in directory A, save the summary.
  2. Do the same for directory B.
  3. Compare A_summary with B_summary.
  4. Compare B_summary with A_summary.

Here is what I would change the code to in order to get going toward a solution.

import csv
import re
import os
PHONE_PATTERN = re.compile(r'(\+?[2-9]\d{2}\)?[ -]?\d{3}[ -]?\d{4})')
def process_file(filename, phone_numbers=None):
 """ Add phone numbers from a file to the set of phone_numbers. """
 if phone_numbers = None:
 phone_numbers = set()
 with open(filename, 'rbU') as ifile:
 read = csv.reader(ifile)
 for row in read:
 for column in row:
 s1 = column.strip()
 match = PHONE_PATTERN.search(s1)
 if match:
 canonical_phone = re.sub(r'\D', '', match.group(0))
 if canonical_phone not in phone_numbers:
 phone_numbers.add(canonical_phone)
 return phone_numbers
# Check each directory, creating a set to tally up the unique numbers.
for directory in ['directory_A_path', 'directory_B_path']:
 phone_numbers = set()
 # Process all the files in a directory
 for filename in os.listdir(directory):
 if(filename.endswith(".csv")):
 phone_numbers = process_file(filename, phone_numbers)
 # Create a summary file for the directory and write the numbers
 with open(directory + "_summary.csv") as summary:
 summary_writer = csv.writer(summary)
 for phone_number in phone_numbers:
 summary_writer.writerow([phone_number])
summary_A = process_file('directory_A_path_summary.csv')
summary_B = process_file('directory_A_path_summary.csv')

I think you could then diff summaries and write them out. Hopefully it is helpful.

answered Apr 16, 2014 at 2:00
\$\endgroup\$
3
\$\begingroup\$

A simple way to avoid excessive indentation is to combine all of the with blocks into one:

with open('List1.csv', 'wb') as out1, \
 open('List2.csv', 'wb') as out2, \
 open('List3.csv', 'wb') as out3, \
 open('List4.csv', 'wb') as out4:
 seen = set()
 # etc.

I've renamed outout1 for parallelism.

However, another problem is that the code is repetitive and monolithic. Ideally, there should be a function, say read_unique_phone_numbers_from_csv_files(glob). There is a handy data structure for that: collections.OrderedDict.

def read_unique_phone_numbers_from_csv_files(csv_glob):
 regex = re.compile(r'(\+?[2-9]\d{2}\)?[ -]?\d{3}[ -]?\d{4})')
 unique_phones = collections.OrderedDict()
 csv_files = glob.glob(csv_glob)
 for filename in csv_files:
 with open(filename, 'rbU') as ifile:
 read = csv.reader(ifile)
 for row in read:
 for column in row:
 # column.strip() is superfluous
 match = regex.search(column)
 if match:
 canonical_phone = re.sub(r'\D', '', match.group(0))
 unique_phones[canonical_phone] = True
 return unique_phones

With that function defined, the main loop should become quite simple.

answered Apr 16, 2014 at 6:39
\$\endgroup\$
3
\$\begingroup\$

Try to avoid deeply nesting code as much as possible. Deeply nested code has high cyclomatic complexity -- too many possible execution paths. The higher complexity, the harder to test, more error prone, fragile.

The deep nesting of out...out3 file handlers is especially bad, because they are independent from each other. Try to write this more "horizontally", without nesting them within each other.


Another big issue with your code is duplicated logic:

  • You could write the logic reading from a file, and use it twice when you read the two input files

  • You could write the logic writing to a file, and use it 4 times when you write the output files

If you do something more than once, you should extract to a separate method so it's less typing, and if you have to fix a bug, you can fix it once, which is easier and more robust.


This is pointless, you're not using it, so remove it:

import string

Putting it all together (untested):

def read_unique_numbers_from_reader(reader):
 unique_numbers = set()
 for row in reader:
 for column in row:
 s1 = column.strip()
 match = regex.search(s1)
 if match:
 canonical_phone = re.sub(r'\D', '', match.group(0))
 unique_numbers.add(canonical_phone)
 return unique_numbers
def read_unique_numbers_from_files(glob_pattern):
 unique_numbers = set()
 for filename in glob.glob(glob_pattern):
 with open(filename, 'rbU') as ifile:
 reader = csv.reader(ifile)
 unique_numbers.update(read_unique_numbers_from_reader(reader))
 return unique_numbers
def write_numbers(filename, numbers):
 with open(filename, 'wb') as out:
 out_writer = csv.writer(out)
 out_writer.writerow([])
 for val in numbers:
 out_writer.writerow([val])
numbers_in_first = read_unique_numbers_from_files('First*.csv')
write_numbers('Firstlist.csv', numbers_in_first)
numbers_in_second = read_unique_numbers_from_files('Second*.csv')
write_numbers('Secondlist.csv', numbers_in_second)
write_numbers('SecondnotinFirst.csv', numbers_in_second - numbers_in_first)
write_numbers('FirstnotinSecond.csv', numbers_in_first - numbers_in_second)
answered Apr 16, 2014 at 6:04
\$\endgroup\$

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.