Im trying to output the difference between 2 csv files by two columns and create a third csv file. How can I make the following code compare by columns 0 and 3.
import csv
f1 = open ("ted.csv")
oldFile1 = csv.reader(f1, delimiter=',')
oldList1 = list(oldFile1)
f2 = open ("ted2.csv")
newFile2 = csv.reader(f2, delimiter=',')
newList2 = list(newFile2)
f1.close()
f2.close()
output1 = set(tuple(row) for row in newList2 if row not in oldList1)
output2 = set(tuple(row) for row in oldList1 if row not in newList2)
with open('Michal_K.csv','w') as csvfile:
wr = csv.writer(csvfile,delimiter=',')
for line in (output2).difference(output1):
wr.writerow(line)
-
2This is the kind of thing pandas was written for. Take a look at that library!AZhao– AZhao2015年07月19日 18:36:50 +00:00Commented Jul 19, 2015 at 18:36
1 Answer 1
If you want the rows from ted.csv that do not have any of the same third and fourth column elements as ted2, create a set of those elements from the ted2 and check each row from ted.csv before writing:
with open("ted.csv") as f1, open("ted2.csv") as f2, open('foo.csv', 'w') as out:
r1, r2 = csv.reader(f1), csv.reader(f2)
st = set((row[0], row[3]) for row in r1)
wr = csv.writer(out)
for row in (row for row in r2 if (row[0],row[3]) not in st):
wr.writerow(row)
If you actually want something like the symmetric difference where you get unique rows from both then make a set of each third and fourth columns from both files :
from itertools import chain
with open("ted.csv") as f1, open("ted2.csv") as f2, open('foo.csv', 'w') as out:
r1, r2 = csv.reader(f1), csv.reader(f2)
st1 = set((row[0], row[3]) for row in r1)
st2 = set((row[0], row[3]) for row in r2)
f1.seek(0), f2.seek(0)
wr = csv.writer(out)
r1, r2 = csv.reader(f1), csv.reader(f2)
output1 = (row for row in r1 if (row[0], row[3]) not in st2)
output2 = (row for row in r2 if (row[0], row[3]) not in st1)
for row in chain.from_iterable((output1, output2)):
wr.writerow(row)
answered Jul 19, 2015 at 18:19
Padraic Cunningham
181k30 gold badges264 silver badges327 bronze badges
Sign up to request clarification or add additional context in comments.
7 Comments
Michal K
Thanks , Im after the rows that dont have the same element in row[0] and row[3] . Will still try the second approach to figure out the difference.
Padraic Cunningham
The second approach should give you the symmetric difference based on the first and fourth columns
Michal K
Second approach gives me a list index out of range.
Padraic Cunningham
Then you don't have at least four values in each row, add a link to the data if possible
Michal K
Had a good look at all the files the first two have 4 columns the third one is an empty file. Could that be the issue File "C:/testcsv/Pandaman.py", line 28, in <genexpr> st = set((row[0], row[3]) for row in r1) IndexError: list index out of range Ive also tried row[2] but same results
|
lang-py