Here's the exercise in brief:
Consider the following file: Code:
before.csv
A; ; B; B; A; H; C; ; D; D; C; G; E; D; F; F; E; H; G; D; ; H; G; ;
And a modified version of the file:
after.csv
A; ; B; B; A; H; C; ; D; D; ; G; E; D; F; F; E; H; G; D; ; K; ; E;
The first field of the CSV is a unique identifier of each line. The exercise consists of detecting the changes applied to the file, by comparing before and after.
There are 3 types of changes you should detect:
- ADDED (line is present in after.csv but not in before.csv)
- REMOVED (line is present in before.csv but not in after.csv)
- MODIFIED (line is present in both, but second and/or third field are modified)
In my example, there are three modifications:
- ADDED line (K)
- REMOVED line (H)
- MODIFIED line (D)
And my code:
import collections
import csv
import sys
class P_CSV(dict):
'''A P_CSV is a dict representation of the csv file:
{"id": dict(csvfile)} '''
fieldnames = ["id", "col2", "col3"]
def __init__(self, input):
map(self.readline, csv.DictReader(input, self.fieldnames, delimiter=";",\
skipinitialspace=True))
def readline(self, line):
self[line["id"]] = line
def get_elem(self, name):
for i in self:
if i == name:
return self[i]
class Change:
''' a Change element will be instanciated
each time a difference is found'''.
def __init__(self, *args):
self.args=args
def echo(self):
print "\t".join(self.args)
class P_Comparator(collections.Counter):
'''This class holds 2 P_CSV objects and counts
the number of occurrence of each line.'''
def __init__(self, in_pcsv, out_pcsv):
self.change_list = []
self.in_pcsv = in_pcsv
self.out_pcsv = out_pcsv
self.readfile(in_pcsv, 1)
self.readfile(out_pcsv, -1)
def readfile(self, file, factor):
for key in file:
self[key] += factor
def scan(self):
for i in self:
if self[i] == -1:
self.change_list.append(Change("ADD", i))
elif self[i] == 1:
self.change_list.append(Change("DELETE", i))
else: # element exists in two files. Check if modified
j = J_Comparator(self.in_pcsv.get_elem(i), self.out_pcsv.get_elem(i))
if len(j) > 0:
self.change_list += j
class J_Comparator(list):
'''This class compares the attributes of two lines and return a list
of Changes object for every difference'''
def __init__(self, indict, outdict):
for i in indict:
if indict[i] != outdict[i]:
self.append(Change("MODIFY", indict["id"], i, indict[i], "BECOMES", outdict[i]))
if len(self) == 0:
self = None
#Main
p = P_Comparator(P_CSV(open(sys.argv[1], "rb")), P_CSV(open(sys.argv[2], "rb")))
p.scan()
print "{} changes".format(len(p.change_list))
[c.echo() for c in p.change_list]
In real life, the code is supposed to compare two very large files (>6500 lines) with much more fields (>10).
How can I improve, both the style of my programming as well as the performance of the script? For the record I'm using Python2.7
4 Answers 4
Implementation
- The P_CSV trick is good idea.
- I don't know if "input" is supposed to be a file name, a string, a
file
object, and so on (this is Python's fault, but still). Please use a better name and document that it is afile
. - What does
{"id": dict(csvfile)}
mean in your docstring? get_elem
could be implemented withreturn self.get(name, default=None)
. The way you're doing it is misleading, since you're relying on the fact that noreturn
meansreturn None
.- This means you can either remove
get_elem
or find a better name explaining that it's just likeget
except that it returnsNone
instead of throwing an exception. - I guess you want to use
__str__
inChange
, and maybe__repr__
(but notecho
). - Do you really need
Change
as it is? Simply store lists inchange_list
, instead ofChange
s. readfile
? I'd sayreadcsv
since it is no longer a file, but aP_CSV
. If you have a more descriptive name of what those files contain, then use that instead.- J_Comparator doesn't work as requested in the exercise, since it also says what columns were modified.
- Setting a list to
None
is wrong. "No values" is the empty list. You can then useself.change_list.extend(j)
without needing to worry about the empty list. It's much more elegant. - Why P and J for the comparators?
Performance
Performance is good: you're using a linear algorithm, even though you're going through the files twice. If you're worried about very very larges files that won't hold in memory, you can use the assumption that the files are sorted to advance in both files simultaneously, and make sure to always have the same unique id in both files. I don't think this is needed, 6000 lines is quite small!
-
\$\begingroup\$ Thank you so much! I'll try to make my variable names clearer and more dscriptive. I'll also get rid of the
get_elem
method, I have just realized that P_CSV inherits fromdict
so I can accessin_pcsv[i]
directly. I should also consider raising exceptions in case it's empty. Also thank you for both the__repr__
andlist.extend()
tips :) \$\endgroup\$rahmu– rahmu2012年03月06日 13:34:45 +00:00Commented Mar 6, 2012 at 13:34
I'll start from the very first method of your code, and then we'll see how everything could be improved.
Let's also start from a matter of syle, this line:
def __init__(self, input):
map(self.readline, csv.DictReader(input, self.fieldnames, delimiter=";",\
skipinitialspace=True)) ^
|
there's no need for this
could just be:
def __init__(self, input):
map(self.readline, csv.DictReader(input, self.fieldnames, delimiter=";",
skipinitialspace=True))
Further style guides can be found in PEP8.
Also that use of map
seems hackish, it would be more clear this way:
def __init__(self, csvfile):
for dct in csv.DictReader(csvfile, self.fieldnames, delimiter=";",
skipinitialspace=True)):
self.readline(dct)
I've also changed the argument name to csvfile
to state clearer what that argument is and for not shadowing unecessary the built-in input()
.
But I'm sure you see that it still looks somehow ugly. Where the reasons are 2:
- You don't need the
readline
method, but thedict
__init__ - You've subclassed the wrong dict.
Yes, why don't subclass csv.DictReader
? If you go and take a look into the source file csv.py
you'll see that DictReader
is actually quite a simple class, so maybe you could just copy what you need and write yours. The bare bone could be something like:
class IdDictReader:
def __init__(self, fieldnames, *args, **kwargs):
self.fieldnames = fieldnames
self.reader = reader(f, dialect, *args, **kwds)
self.dialect = dialect
self.line_num = 0
def __iter__(self):
return self
def next(self):
row = self.reader.next()
self.line_num = self.reader.line_num
while row == []:
row = self.reader.next()
d = {row[0]: zip(self.fieldnames, row[1:]))
return d
In the same way you'll need to write IdDictWriter
.
This alone will improve your performance a bit, but we haven't touched the core of you system yet.
Now you should also lose the P_Comparator
class, because a Counter
seems fragile. I would go with set
s.
So let's say that after
and before
are two dict
organaized exactly like you would (thanks to the IdDictReader
):
before = {'B': {'col2': 'A', 'col2': 'H'}, ...}
after = { ...
before_ids = set(before)
after_ids = set(after)
added = after_ids - before_ids
deleted = before_ids - after_ids
common = before_ids & after_ids
Now you'll just need to see which are changed:
for _id in common:
if before[_id] != after[_id]:
# ...
set
s in Python are very well implemented, so I'd say that you should see a considerable performance improvement. But when it comes to performances the only thing to do is time it and profile it and see what's best.
-
\$\begingroup\$ Thank you. I just replaced the Counter with
set
s. It works like a charm, although for my volume of files (6500+ lines, 10 fields) the performance gain is minimal (~1-2%) on my machine. What's wrong with my use ofmap()
? \$\endgroup\$rahmu– rahmu2012年03月07日 12:58:31 +00:00Commented Mar 7, 2012 at 12:58
Based on http://wiki.python.org/moin/TimeComplexity where the theoretical run-time is described as:
- List: x in s -> O(n)
- Set: x in s -> O(1) to O(n) # NB! Sets are immutable
- Dict: get item -> O(1) to O(n)
...this is fastest diff function I could come up with, using O(1) as much as possible:
import sys # a bit superfluous if you delete the "__main__" part at the bottom.
def diff(before, after, delimiter=';'):
# Here I use lists to main the ability to cycle through the set by position.
oldlist=list(before.split(sep=delimiter))
newlist=list(after.split(sep=delimiter))
# Here I use sets to benefit from the O(k) lookup time.
old=set(oldlist)
new=set(newlist)
inboth = old & new # saves us from looking at a lot of redundant stuff.
unchanged,added,removed,modified={},{},{},{} # I love multiple assignment.
for index in range(len(oldlist)+len(newlist)):
try:
if newlist[index] in inboth:
#print('\t', newlist[index]) # uncomment to see diff.
unchanged[index]=newlist[index]
if newlist[index] not in old:
#print('+:\t', newlist[index]) # uncomment to see diff.
added[index]=newlist[index]
if oldlist[index] not in new:
#print('-:\t', oldlist[index]) # uncomment to see diff.
removed[index]=oldlist[index]
if index in added and index in removed:
modified[index]=str(added.pop(index))+"-->"+str(removed.pop(index))
except IndexError:
pass
return unchanged,added,removed,modified
def main():
if sys.version_info <= (3, 2):
sys.stdout.write("Sorry, requires Python 3.2.x, not Python 2.x\n")
sys.exit(1)
before= "1;2;Bla;4;5;6;7;Bla2;8"
after = "1;2;Bob;4;5;7;Bob2;8;9;10"
# modified Bla -> Bob, Bla2 -> Bob2
# removed 6
# added 9, 10
unchanged, added, removed, modified = diff(before, after)
print ("-------------------\n",
"asis:\t",len(unchanged),"\n",
"adds:\t",len(added),"\n",
"rems:\t",len(removed),"\n",
"mods:\t",len(modified))
if __name__ == '__main__':
main()
I think that a simpler approach could be using the two files "indexes" using list comprehensions. What I would propose:
create an Id list for the before and after file, extracted for the CSV (as you already do)
So you would end up with something like:
before = ['A','B','C','D','E','F','G','H']
after = ['A','B','C','D','E','F','G','K']
and then you can:
removed = [x for x in before if x not in after]
added = [x for x in after if x not in before]
Everything else looks well for me. Hope this helps!
-
\$\begingroup\$ This would be perfectly clear, but wouldn't it be super slow? You parse the whole after file for each before entry AND THEN do the same the other way around! In my script you parse each file once and increment/decrement the counter accordingly. If the result is not 0, then you know there was a change. \$\endgroup\$rahmu– rahmu2012年03月06日 12:56:06 +00:00Commented Mar 6, 2012 at 12:56
id
s? Because if you can I would suggest a different approach, i.e. parsing the two files side by side. \$\endgroup\$Change
class in particular will have multiple methods but that's beyond the scope of this code review. (I'm assuming you meant the Unixdiff
tool. If you meant some Python utility then please show me some link). Maybe you reckon I rundiff
on both files and pipe the output to my script? \$\endgroup\$