Manipulating csv files with Python

Question 1

Im trying to output the difference between 2 csv files by two columns and create a third csv file. How can I make the following code compare by columns 0 and 3.

import csv
f1 = open ("ted.csv")
oldFile1 = csv.reader(f1, delimiter=',')
oldList1 = list(oldFile1)
f2 = open ("ted2.csv")
newFile2 = csv.reader(f2, delimiter=',')
newList2 = list(newFile2)
f1.close()
f2.close()
output1 = set(tuple(row) for row in newList2 if row not in oldList1)
output2 = set(tuple(row) for row in oldList1 if row not in newList2)
with open('Michal_K.csv','w') as csvfile:
 wr = csv.writer(csvfile,delimiter=',')
 for line in (output2).difference(output1):
 wr.writerow(line)

Question 2

This is the kind of thing pandas was written for. Take a look at that library!

Question 3

If you want the rows from ted.csv that do not have any of the same third and fourth column elements as ted2, create a set of those elements from the ted2 and check each row from ted.csv before writing:

with open("ted.csv") as f1, open("ted2.csv") as f2, open('foo.csv', 'w') as out:
 r1, r2 = csv.reader(f1), csv.reader(f2)
 st = set((row[0], row[3]) for row in r1)
 wr = csv.writer(out)
 for row in (row for row in r2 if (row[0],row[3]) not in st):
 wr.writerow(row)

If you actually want something like the symmetric difference where you get unique rows from both then make a set of each third and fourth columns from both files :

from itertools import chain
with open("ted.csv") as f1, open("ted2.csv") as f2, open('foo.csv', 'w') as out:
 r1, r2 = csv.reader(f1), csv.reader(f2)
 st1 = set((row[0], row[3]) for row in r1)
 st2 = set((row[0], row[3]) for row in r2)
 f1.seek(0), f2.seek(0)
 wr = csv.writer(out)
 r1, r2 = csv.reader(f1), csv.reader(f2)
 output1 = (row for row in r1 if (row[0], row[3]) not in st2)
 output2 = (row for row in r2 if (row[0], row[3]) not in st1)
 for row in chain.from_iterable((output1, output2)):
 wr.writerow(row)

Question 4

Thanks , Im after the rows that dont have the same element in row[0] and row[3] . Will still try the second approach to figure out the difference.

Question 5

The second approach should give you the symmetric difference based on the first and fourth columns

Question 6

Second approach gives me a list index out of range.

Question 7

Then you don't have at least four values in each row, add a link to the data if possible

Question 8

Had a good look at all the files the first two have 4 columns the third one is an empty file. Could that be the issue File "C:/testcsv/Pandaman.py", line 28, in <genexpr> st = set((row[0], row[3]) for row in r1) IndexError: list index out of range Ive also tried row[2] but same results

Padraic Cunningham 181k30 gold badges264 silver badges327 bronze badges · Accepted Answer · 2015-07-19 18:19:02Z

If you want the rows from ted.csv that do not have any of the same third and fourth column elements as ted2, create a set of those elements from the ted2 and check each row from ted.csv before writing:

with open("ted.csv") as f1, open("ted2.csv") as f2, open('foo.csv', 'w') as out:
 r1, r2 = csv.reader(f1), csv.reader(f2)
 st = set((row[0], row[3]) for row in r1)
 wr = csv.writer(out)
 for row in (row for row in r2 if (row[0],row[3]) not in st):
 wr.writerow(row)

If you actually want something like the symmetric difference where you get unique rows from both then make a set of each third and fourth columns from both files :

from itertools import chain
with open("ted.csv") as f1, open("ted2.csv") as f2, open('foo.csv', 'w') as out:
 r1, r2 = csv.reader(f1), csv.reader(f2)
 st1 = set((row[0], row[3]) for row in r1)
 st2 = set((row[0], row[3]) for row in r2)
 f1.seek(0), f2.seek(0)
 wr = csv.writer(out)
 r1, r2 = csv.reader(f1), csv.reader(f2)
 output1 = (row for row in r1 if (row[0], row[3]) not in st2)
 output2 = (row for row in r2 if (row[0], row[3]) not in st1)
 for row in chain.from_iterable((output1, output2)):
 wr.writerow(row)

Thanks , Im after the rows that dont have the same element in row[0] and row[3] . Will still try the second approach to figure out the difference.
The second approach should give you the symmetric difference based on the first and fourth columns
Then you don't have at least four values in each row, add a link to the data if possible
Had a good look at all the files the first two have 4 columns the third one is an empty file. Could that be the issue File "C:/testcsv/Pandaman.py", line 28, in <genexpr> st = set((row[0], row[3]) for row in r1) IndexError: list index out of range Ive also tried row[2] but same results

CollectivesTM on Stack Overflow

Manipulating csv files with Python

1 Answer 1

7 Comments

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

CollectivesTM on Stack Overflow

1 Answer 1

7 Comments

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related