What the following code does is:
The key field is column0 (where sometimes there could be single key, and sometimes keys separated by comma).
The rest of the columns are also either single or comma separated. But based on the single key value from column0, the goal is to collect all the values as a set in the rest of the columns.
import sys
import csv
dict={}
def getVal(k):
try:
v = dict[k]
except:
v= None
return v
# First read as a line and first transformation separates first column
# and stores into a table with key,value (where value is the remaining)
for line in sys.stdin:
line = line.strip()
row = line.split('\t')
l = len(row)
keyvalues = row[0]
items = keyvalues.split(",")
for item in items:
key = item
value=[]
for i in range(1, l):
value.append(row[i].split(","))
if getVal(key) == None:
dict[key] = value
else:
prevval = dict[key]
cols = len(prevval)
newvalue=[]
newlist = []
for j in range(0,cols):
newlist = prevval[j]+value[j]
newset = set(newlist)
newvalue.append(list(newset))
dict[key] = newvalue
for k,v in dict.items():
rowstr = k+"\t"
ncols = len(v)
for items in v:
cols=""
for item in items:
cols +=item+","
cols = cols[0:len(cols)-1]
rowstr += cols+"\t"
print rowstr
Sample input
3,15,75 1635,1762 878 3425 121,122,123
15 1762 871 3475 121,125,126
3 1585,1590,5192 882,832,841 3200,3211 120,121,122,123,124
I'm getting the results correctly as expected, but I would like to know any improvements to make on the code.
2 Answers 2
Indentation matters a lot in Python. Please follow the PEP 8 standard, which is 4 spaces.
This code feels very procedural. Python can be much more expressive than that, if you take advantage of features like list comprehensions.
You've used a lot of variables: line
, row
, l
, keyvalues
, items
, item
, key
, value
, i
, prevval
, cols
, newvalue
, newlist
, j
, newset
— too many for me to keep track of what each represents. You could reduce the mental load by eliminating variables for ephemeral expressions. For example, you could eliminate items
by writing for item in keyvalues.split(",")
.
I see that you considered using the csv
module, but didn't. If you did, you could have simplified
for line in sys.stdin: line = line.strip() row = line.split('\t')
... to for row in csv.reader(fileinput.input(), delimiter='\t')
. I recommend fileinput
instead of sys.stdin
so that the user has the option of passing a filename on the command line.
The getVal(k)
function can just be replaced by dict.get(k, None)
. It would be better to avoid choosing dict
as the name of a variable, since it shadows the dict()
constructor.
Suggested solution
import csv
import fileinput
data = {}
for row in csv.reader(fileinput.input(), delimiter='\t'):
values = [set(col.split(',')) for col in row]
for key in values.pop(0): # Numbers in column 0
data[key] = [
new_col.union(old_col)
for new_col, old_col in zip(values, data.get(key, values))
]
for key, values in data.items():
print '\t'.join([key] + [','.join(col) for col in values])
Dictionary usage
Your getVal
function is the get
method of standard dictionaries. Use dict.get(key)
instead of getVal(key)
. However, 3 things that are wrong with it:
It uses the global variable
dict
to retrieve values from, you should have parametrize it with the dictionary instead:def getVal(dict_, key): try: v = dict_[key] except: v = None return v
It uses
dict
as a variable name, shadowing the builtin function.- It uses a raw
except
, catching the expectedKeyError
but potentially catching more serious issues. Always specify the kind of exception you are expecting.
Now for getVal
usage, you set a value into dict[key]
if getVal
returned None
otherwise you process further. In such case, you will set the value because the flow got into the except
in your getVal
function, meaning key
is not in dict
. It's quite simple to test in python:
if key not in dict:
dict[key] = value
else:
prevval = ...
No need for getVal
anymore...