Comparing 2 CSV files

Question 1

Here's the exercise in brief:

Consider the following file: Code:

before.csv
A; ; B; 
B; A; H; 
C; ; D; 
D; C; G; 
E; D; F; 
F; E; H; 
G; D; ; 
H; G; ;
And a modified version of the file:

after.csv
A; ; B; 
B; A; H; 
C; ; D;
D; ; G;
E; D; F;
F; E; H;
G; D; ;
K; ; E;
The first field of the CSV is a unique identifier of each line. The exercise consists of detecting the changes applied to the file, by comparing before and after.

There are 3 types of changes you should detect:

ADDED (line is present in after.csv but not in before.csv)

REMOVED (line is present in before.csv but not in after.csv)

MODIFIED (line is present in both, but second and/or third field are modified)

In my example, there are three modifications:

ADDED line (K)

REMOVED line (H)

MODIFIED line (D)

And my code:

import collections
import csv
import sys
class P_CSV(dict):
 '''A P_CSV is a dict representation of the csv file:
 {"id": dict(csvfile)} '''
 fieldnames = ["id", "col2", "col3"]
 def __init__(self, input):
 map(self.readline, csv.DictReader(input, self.fieldnames, delimiter=";",\
 skipinitialspace=True))
 def readline(self, line):
 self[line["id"]] = line
 def get_elem(self, name):
 for i in self:
 if i == name:
 return self[i]
class Change:
 ''' a Change element will be instanciated
 each time a difference is found'''.
 def __init__(self, *args):
 self.args=args
 def echo(self):
 print "\t".join(self.args)
class P_Comparator(collections.Counter):
 '''This class holds 2 P_CSV objects and counts
 the number of occurrence of each line.'''
 def __init__(self, in_pcsv, out_pcsv):
 self.change_list = []
 self.in_pcsv = in_pcsv
 self.out_pcsv = out_pcsv
 self.readfile(in_pcsv, 1)
 self.readfile(out_pcsv, -1)
 def readfile(self, file, factor):
 for key in file:
 self[key] += factor
 def scan(self):
 for i in self:
 if self[i] == -1:
 self.change_list.append(Change("ADD", i))
 elif self[i] == 1:
 self.change_list.append(Change("DELETE", i))
 else: # element exists in two files. Check if modified
 j = J_Comparator(self.in_pcsv.get_elem(i), self.out_pcsv.get_elem(i))
 if len(j) > 0:
 self.change_list += j
class J_Comparator(list):
 '''This class compares the attributes of two lines and return a list
 of Changes object for every difference'''
 def __init__(self, indict, outdict):
 for i in indict:
 if indict[i] != outdict[i]:
 self.append(Change("MODIFY", indict["id"], i, indict[i], "BECOMES", outdict[i]))
 if len(self) == 0:
 self = None
#Main
p = P_Comparator(P_CSV(open(sys.argv[1], "rb")), P_CSV(open(sys.argv[2], "rb")))
p.scan()
print "{} changes".format(len(p.change_list))
[c.echo() for c in p.change_list]

In real life, the code is supposed to compare two very large files (>6500 lines) with much more fields (>10).

How can I improve, both the style of my programming as well as the performance of the script? For the record I'm using Python2.7

Question 2

Could you realy on the sorting of the ids? Because if you can I would suggest a different approach, i.e. parsing the two files side by side.

Question 3

Sure, I'm listening :)

Question 4

Did you think about using diff?

Question 5

@rahmu: "Sure" means that the lines are sorted by ids?

Question 6

I would but this is only a part of a larger script I'm working on. I just spent 2 days improving the diff part, although there is so much more to the script. The Change class in particular will have multiple methods but that's beyond the scope of this code review. (I'm assuming you meant the Unix diff tool. If you meant some Python utility then please show me some link). Maybe you reckon I run diff on both files and pipe the output to my script?

Question 7

Implementation

The P_CSV trick is good idea.
I don't know if "input" is supposed to be a file name, a string, a file object, and so on (this is Python's fault, but still). Please use a better name and document that it is a file.
What does {"id": dict(csvfile)} mean in your docstring?
get_elem could be implemented with return self.get(name, default=None). The way you're doing it is misleading, since you're relying on the fact that no return means return None.
This means you can either remove get_elem or find a better name explaining that it's just like get except that it returns None instead of throwing an exception.
I guess you want to use __str__ in Change, and maybe __repr__ (but not echo).
Do you really need Change as it is? Simply store lists in change_list, instead of Changes.
readfile? I'd say readcsv since it is no longer a file, but a P_CSV. If you have a more descriptive name of what those files contain, then use that instead.
J_Comparator doesn't work as requested in the exercise, since it also says what columns were modified.
Setting a list to None is wrong. "No values" is the empty list. You can then use self.change_list.extend(j) without needing to worry about the empty list. It's much more elegant.
Why P and J for the comparators?

Performance

Performance is good: you're using a linear algorithm, even though you're going through the files twice. If you're worried about very very larges files that won't hold in memory, you can use the assumption that the files are sorted to advance in both files simultaneously, and make sure to always have the same unique id in both files. I don't think this is needed, 6000 lines is quite small!

Question 8

Thank you so much! I'll try to make my variable names clearer and more dscriptive. I'll also get rid of the get_elem method, I have just realized that P_CSV inherits from dict so I can access in_pcsv[i] directly. I should also consider raising exceptions in case it's empty. Also thank you for both the __repr__ and list.extend() tips :)

Question 9

I'll start from the very first method of your code, and then we'll see how everything could be improved.

Let's also start from a matter of syle, this line:

def __init__(self, input):
 map(self.readline, csv.DictReader(input, self.fieldnames, delimiter=";",\
 skipinitialspace=True)) ^
 |
 there's no need for this

could just be:

def __init__(self, input):
 map(self.readline, csv.DictReader(input, self.fieldnames, delimiter=";",
 skipinitialspace=True))

Further style guides can be found in PEP8.
Also that use of map seems hackish, it would be more clear this way:

def __init__(self, csvfile):
 for dct in csv.DictReader(csvfile, self.fieldnames, delimiter=";",
 skipinitialspace=True)):
 self.readline(dct)

I've also changed the argument name to csvfile to state clearer what that argument is and for not shadowing unecessary the built-in input().

But I'm sure you see that it still looks somehow ugly. Where the reasons are 2:

You don't need the readline method, but the dict __init__
You've subclassed the wrong dict.

Yes, why don't subclass csv.DictReader? If you go and take a look into the source file csv.py you'll see that DictReader is actually quite a simple class, so maybe you could just copy what you need and write yours. The bare bone could be something like:

class IdDictReader:
 def __init__(self, fieldnames, *args, **kwargs):
 self.fieldnames = fieldnames
 self.reader = reader(f, dialect, *args, **kwds)
 self.dialect = dialect
 self.line_num = 0
 def __iter__(self):
 return self
 def next(self):
 row = self.reader.next()
 self.line_num = self.reader.line_num
 while row == []:
 row = self.reader.next()
 d = {row[0]: zip(self.fieldnames, row[1:]))
 return d

In the same way you'll need to write IdDictWriter.

This alone will improve your performance a bit, but we haven't touched the core of you system yet.

Now you should also lose the P_Comparator class, because a Counter seems fragile. I would go with sets.

So let's say that after and before are two dict organaized exactly like you would (thanks to the IdDictReader):

before = {'B': {'col2': 'A', 'col2': 'H'}, ...}
after = { ...
before_ids = set(before)
after_ids = set(after)
added = after_ids - before_ids
deleted = before_ids - after_ids
common = before_ids & after_ids

Now you'll just need to see which are changed:

for _id in common:
 if before[_id] != after[_id]:
 # ...

sets in Python are very well implemented, so I'd say that you should see a considerable performance improvement. But when it comes to performances the only thing to do is time it and profile it and see what's best.

Question 10

Thank you. I just replaced the Counter with sets. It works like a charm, although for my volume of files (6500+ lines, 10 fields) the performance gain is minimal (~1-2%) on my machine. What's wrong with my use of map()?

Question 11

Based on http://wiki.python.org/moin/TimeComplexity where the theoretical run-time is described as:

List: x in s -> O(n)
Set: x in s -> O(1) to O(n) # NB! Sets are immutable
Dict: get item -> O(1) to O(n)

...this is fastest diff function I could come up with, using O(1) as much as possible:

import sys # a bit superfluous if you delete the "__main__" part at the bottom.
def diff(before, after, delimiter=';'):
 # Here I use lists to main the ability to cycle through the set by position.
 oldlist=list(before.split(sep=delimiter))
 newlist=list(after.split(sep=delimiter))
 # Here I use sets to benefit from the O(k) lookup time.
 old=set(oldlist)
 new=set(newlist)
 inboth = old & new # saves us from looking at a lot of redundant stuff.
 unchanged,added,removed,modified={},{},{},{} # I love multiple assignment.
 for index in range(len(oldlist)+len(newlist)):
 try:
 if newlist[index] in inboth:
 #print('\t', newlist[index]) # uncomment to see diff.
 unchanged[index]=newlist[index]
 if newlist[index] not in old:
 #print('+:\t', newlist[index]) # uncomment to see diff.
 added[index]=newlist[index]
 if oldlist[index] not in new:
 #print('-:\t', oldlist[index]) # uncomment to see diff.
 removed[index]=oldlist[index]
 if index in added and index in removed:
 modified[index]=str(added.pop(index))+"-->"+str(removed.pop(index))
 except IndexError:
 pass
 return unchanged,added,removed,modified
def main():
 if sys.version_info <= (3, 2):
 sys.stdout.write("Sorry, requires Python 3.2.x, not Python 2.x\n")
 sys.exit(1)
 before= "1;2;Bla;4;5;6;7;Bla2;8"
 after = "1;2;Bob;4;5;7;Bob2;8;9;10"
 # modified Bla -> Bob, Bla2 -> Bob2
 # removed 6
 # added 9, 10
 unchanged, added, removed, modified = diff(before, after)
 print ("-------------------\n",
 "asis:\t",len(unchanged),"\n",
 "adds:\t",len(added),"\n",
 "rems:\t",len(removed),"\n",
 "mods:\t",len(modified))
if __name__ == '__main__':
 main()

Question 12

I think that a simpler approach could be using the two files "indexes" using list comprehensions. What I would propose:

create an Id list for the before and after file, extracted for the CSV (as you already do)

So you would end up with something like:

before = ['A','B','C','D','E','F','G','H']
after = ['A','B','C','D','E','F','G','K']

and then you can:

removed = [x for x in before if x not in after]
added = [x for x in after if x not in before]

Everything else looks well for me. Hope this helps!

Question 13

This would be perfectly clear, but wouldn't it be super slow? You parse the whole after file for each before entry AND THEN do the same the other way around! In my script you parse each file once and increment/decrement the counter accordingly. If the result is not 0, then you know there was a change.

Quentin Pradet Quentin Pradet 7,0641 gold badge25 silver badges44 bronze badges · Accepted Answer · 2012-03-06 13:11:05Z

Implementation

The P_CSV trick is good idea.
I don't know if "input" is supposed to be a file name, a string, a file object, and so on (this is Python's fault, but still). Please use a better name and document that it is a file.
What does {"id": dict(csvfile)} mean in your docstring?
get_elem could be implemented with return self.get(name, default=None). The way you're doing it is misleading, since you're relying on the fact that no return means return None.
This means you can either remove get_elem or find a better name explaining that it's just like get except that it returns None instead of throwing an exception.
I guess you want to use __str__ in Change, and maybe __repr__ (but not echo).
Do you really need Change as it is? Simply store lists in change_list, instead of Changes.
readfile? I'd say readcsv since it is no longer a file, but a P_CSV. If you have a more descriptive name of what those files contain, then use that instead.
J_Comparator doesn't work as requested in the exercise, since it also says what columns were modified.
Setting a list to None is wrong. "No values" is the empty list. You can then use self.change_list.extend(j) without needing to worry about the empty list. It's much more elegant.
Why P and J for the comparators?

Performance

Performance is good: you're using a linear algorithm, even though you're going through the files twice. If you're worried about very very larges files that won't hold in memory, you can use the assumption that the files are sorted to advance in both files simultaneously, and make sure to always have the same unique id in both files. I don't think this is needed, 6000 lines is quite small!

Thank you so much! I'll try to make my variable names clearer and more dscriptive. I'll also get rid of the get_elem method, I have just realized that P_CSV inherits from dict so I can access in_pcsv[i] directly. I should also consider raising exceptions in case it's empty. Also thank you for both the __repr__ and list.extend() tips :)

Stack Exchange Network

Comparing 2 CSV files

4 Answers 4

Implementation

Performance

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Comparing 2 CSV files

4 Answers 4

Implementation

Performance

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions