I want to go through each line of the a .csv file and compare to see if the first field of line 1 is the same as first field of next line and so on. If it finds a match then I would like to ignore those two lines that contains the same fields and keep the lines where there is no match.
Here is an example dataset (no_dup.txt):
Ac_Gene_ID M_Gene_ID ENSGMOG00000015632 ENSORLG00000010573 ENSGMOG00000015632 ENSORLG00000010585 ENSGMOG00000003747 ENSORLG00000006947 ENSGMOG00000003748 ENSORLG00000004636
Here is the output that I wanted:
Ac_Gene_ID M_Gene_ID ENSGMOG00000003747 ENSORLG00000006947 ENSGMOG00000003748 ENSORLG00000004636
Here is my code that works, but I want to see how it can be improved:
import sys
in_file = sys.argv[1]
out_file = sys.argv[2]
entries = {}
entries1 = {}
with open(in_file, 'r') as fh_in:
for line in fh_in:
if line.startswith('E'):
line = line.strip()
line = line.split()
entry = line[0]
if entry in entries:
entries[entry].append(line)
else:
entries[entry] = [line]
with open('no_dup_out.txt', 'w') as fh_out:
for kee, val in entries.iteritems():
if len(val) == 1:
fh_out.write("{} \n".format(val))
with open('no_dup_out.txt', 'r') as fh_in2:
for line in fh_in2:
line = line.strip()
line = line.split()
entry = line[1]
if entry in entries1:
entries1[entry].append(line)
else:
entries1[entry] = [line]
with open(out_file, 'w') as fh_out2:
for kee, val in entries1.iteritems():
if len(val) == 1:
fh_out2.write("{} \n".format(val))
The output that I am getting:
[["[['ENSGMOG00000003747',", "'ENSORLG00000006947']]"]] [["[['ENSGMOG00000003748',", "'ENSORLG00000004636']]"]]
2 Answers 2
This part
if entry in entries:
entries[entry].append(line)
else:
entries[entry] = [line]
definitly smells like it could/should be written with setdefault
or defaultdict
.
This would be for instance entries.setdefault(entry, []).append(line)
.
Avoid to re-assign the same variable again and again as it makes it harder to hard to understand what the variable is supposed to represent.
line = line.strip()
line = line.split()
could be written : splitted_list = line.strip().split()
You are iterating over key ("kee"?)/values of a dictionnary but ignoring the actual key.
The convention is to use _
as the variable name for throw-away values so you could write : for _, val in entries.iteritems():
. However, it would probably be better to just iterate over the values using itervalues
, values
or viewvalues
.
It's odd that you write no_dup_out.txt
, then immediately read it back in again. Couldn't you just construct entries1
from entries
without doing file I/O?
This code has some weird behaviour, though, that you should be aware of. Consider the following example:
Elephant apple
Elephant banana
Eel apple
If you uniquify the data set based on the first column, then by the second column, you, as you have done in your program, you'll obtain the result:
Eel apple
However, if you were to uniquify the data set based on the second column, then by the first column, you would obtain instead:
Elephant banana
I don't know enough about the motivation behind the code to say whether either of those is the desired outcome. Or perhaps all three rows should be eliminated? In any case, the intended behaviour should be thoroughly described in a docstring to avoid misunderstandings.