3

Im trying to output the difference between 2 csv files by two columns and create a third csv file. How can I make the following code compare by columns 0 and 3.

import csv
f1 = open ("ted.csv")
oldFile1 = csv.reader(f1, delimiter=',')
oldList1 = list(oldFile1)
f2 = open ("ted2.csv")
newFile2 = csv.reader(f2, delimiter=',')
newList2 = list(newFile2)
f1.close()
f2.close()
output1 = set(tuple(row) for row in newList2 if row not in oldList1)
output2 = set(tuple(row) for row in oldList1 if row not in newList2)
with open('Michal_K.csv','w') as csvfile:
 wr = csv.writer(csvfile,delimiter=',')
 for line in (output2).difference(output1):
 wr.writerow(line) 
Alasdair
310k59 gold badges605 silver badges534 bronze badges
asked Jul 19, 2015 at 18:13
1
  • 2
    This is the kind of thing pandas was written for. Take a look at that library! Commented Jul 19, 2015 at 18:36

1 Answer 1

2

If you want the rows from ted.csv that do not have any of the same third and fourth column elements as ted2, create a set of those elements from the ted2 and check each row from ted.csv before writing:

with open("ted.csv") as f1, open("ted2.csv") as f2, open('foo.csv', 'w') as out:
 r1, r2 = csv.reader(f1), csv.reader(f2)
 st = set((row[0], row[3]) for row in r1)
 wr = csv.writer(out)
 for row in (row for row in r2 if (row[0],row[3]) not in st):
 wr.writerow(row) 

If you actually want something like the symmetric difference where you get unique rows from both then make a set of each third and fourth columns from both files :

from itertools import chain
with open("ted.csv") as f1, open("ted2.csv") as f2, open('foo.csv', 'w') as out:
 r1, r2 = csv.reader(f1), csv.reader(f2)
 st1 = set((row[0], row[3]) for row in r1)
 st2 = set((row[0], row[3]) for row in r2)
 f1.seek(0), f2.seek(0)
 wr = csv.writer(out)
 r1, r2 = csv.reader(f1), csv.reader(f2)
 output1 = (row for row in r1 if (row[0], row[3]) not in st2)
 output2 = (row for row in r2 if (row[0], row[3]) not in st1)
 for row in chain.from_iterable((output1, output2)):
 wr.writerow(row)
answered Jul 19, 2015 at 18:19
Sign up to request clarification or add additional context in comments.

7 Comments

Thanks , Im after the rows that dont have the same element in row[0] and row[3] . Will still try the second approach to figure out the difference.
The second approach should give you the symmetric difference based on the first and fourth columns
Second approach gives me a list index out of range.
Then you don't have at least four values in each row, add a link to the data if possible
Had a good look at all the files the first two have 4 columns the third one is an empty file. Could that be the issue File "C:/testcsv/Pandaman.py", line 28, in <genexpr> st = set((row[0], row[3]) for row in r1) IndexError: list index out of range Ive also tried row[2] but same results
|

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.