Comparing two columns in two different rows

Question 1

I want to go through each line of the a .csv file and compare to see if the first field of line 1 is the same as first field of next line and so on. If it finds a match then I would like to ignore those two lines that contains the same fields and keep the lines where there is no match.

Here is an example dataset (no_dup.txt):

Ac_Gene_ID M_Gene_ID
ENSGMOG00000015632 ENSORLG00000010573
ENSGMOG00000015632 ENSORLG00000010585
ENSGMOG00000003747 ENSORLG00000006947
ENSGMOG00000003748 ENSORLG00000004636

Here is the output that I wanted:

Ac_Gene_ID M_Gene_ID
ENSGMOG00000003747 ENSORLG00000006947
ENSGMOG00000003748 ENSORLG00000004636

Here is my code that works, but I want to see how it can be improved:

import sys
in_file = sys.argv[1]
out_file = sys.argv[2]
entries = {}
entries1 = {}
with open(in_file, 'r') as fh_in:
 for line in fh_in:
 if line.startswith('E'):
 line = line.strip()
 line = line.split()
 entry = line[0]
 if entry in entries:
 entries[entry].append(line)
 else:
 entries[entry] = [line]
with open('no_dup_out.txt', 'w') as fh_out:
 for kee, val in entries.iteritems():
 if len(val) == 1:
 fh_out.write("{} \n".format(val))
with open('no_dup_out.txt', 'r') as fh_in2:
 for line in fh_in2:
 line = line.strip()
 line = line.split()
 entry = line[1]
 if entry in entries1:
 entries1[entry].append(line)
 else:
 entries1[entry] = [line]
with open(out_file, 'w') as fh_out2:
 for kee, val in entries1.iteritems():
 if len(val) == 1:
 fh_out2.write("{} \n".format(val))

The output that I am getting:

[["[['ENSGMOG00000003747',", "'ENSORLG00000006947']]"]] 
[["[['ENSGMOG00000003748',", "'ENSORLG00000004636']]"]]

Question 2

This part

 if entry in entries:
 entries[entry].append(line)
 else:
 entries[entry] = [line]

definitly smells like it could/should be written with setdefault or defaultdict.

This would be for instance entries.setdefault(entry, []).append(line).

Avoid to re-assign the same variable again and again as it makes it harder to hard to understand what the variable is supposed to represent.

 line = line.strip()
 line = line.split()

could be written : splitted_list = line.strip().split()

You are iterating over key ("kee"?)/values of a dictionnary but ignoring the actual key.

The convention is to use _ as the variable name for throw-away values so you could write : for _, val in entries.iteritems():. However, it would probably be better to just iterate over the values using itervalues, values or viewvalues.

Question 3

It's odd that you write no_dup_out.txt, then immediately read it back in again. Couldn't you just construct entries1 from entries without doing file I/O?

This code has some weird behaviour, though, that you should be aware of. Consider the following example:

Elephant apple
Elephant banana
Eel apple

If you uniquify the data set based on the first column, then by the second column, you, as you have done in your program, you'll obtain the result:

Eel apple

However, if you were to uniquify the data set based on the second column, then by the first column, you would obtain instead:

Elephant banana

I don't know enough about the motivation behind the code to say whether either of those is the desired outcome. Or perhaps all three rows should be eliminated? In any case, the intended behaviour should be thoroughly described in a docstring to avoid misunderstandings.

SylvainD SylvainDSylvainD 29.7k1 gold badge49 silver badges93 bronze badges · Answer 1 · 2015-06-12 17:31:21Z

This part

 if entry in entries:
 entries[entry].append(line)
 else:
 entries[entry] = [line]

definitly smells like it could/should be written with setdefault or defaultdict.

This would be for instance entries.setdefault(entry, []).append(line).

Avoid to re-assign the same variable again and again as it makes it harder to hard to understand what the variable is supposed to represent.

 line = line.strip()
 line = line.split()

could be written : splitted_list = line.strip().split()

You are iterating over key ("kee"?)/values of a dictionnary but ignoring the actual key.

The convention is to use _ as the variable name for throw-away values so you could write : for _, val in entries.iteritems():. However, it would probably be better to just iterate over the values using itervalues, values or viewvalues.

score 1 · Answer 2 · 2015-06-12 17:48:48Z

It's odd that you write no_dup_out.txt, then immediately read it back in again. Couldn't you just construct entries1 from entries without doing file I/O?

This code has some weird behaviour, though, that you should be aware of. Consider the following example:

Elephant apple
Elephant banana
Eel apple

If you uniquify the data set based on the first column, then by the second column, you, as you have done in your program, you'll obtain the result:

Eel apple

However, if you were to uniquify the data set based on the second column, then by the first column, you would obtain instead:

Elephant banana

I don't know enough about the motivation behind the code to say whether either of those is the desired outcome. Or perhaps all three rows should be eliminated? In any case, the intended behaviour should be thoroughly described in a docstring to avoid misunderstandings.

Stack Exchange Network

Comparing two columns in two different rows

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Comparing two columns in two different rows

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions