Combining values corresponding to the same key

Question 1

What the following code does is:

The key field is column0 (where sometimes there could be single key, and sometimes keys separated by comma).

The rest of the columns are also either single or comma separated. But based on the single key value from column0, the goal is to collect all the values as a set in the rest of the columns.

import sys
import csv
dict={}
def getVal(k):
 try:
 v = dict[k]
 except:
 v= None
 return v
# First read as a line and first transformation separates first column
# and stores into a table with key,value (where value is the remaining)
for line in sys.stdin:
 line = line.strip()
 row = line.split('\t')
 l = len(row)
 keyvalues = row[0]
 items = keyvalues.split(",")
 for item in items:
 key = item
 value=[]
 for i in range(1, l):
 value.append(row[i].split(","))
 if getVal(key) == None:
 dict[key] = value
 else:
 prevval = dict[key]
 cols = len(prevval)
 newvalue=[]
 newlist = []
 for j in range(0,cols):
 newlist = prevval[j]+value[j]
 newset = set(newlist)
 newvalue.append(list(newset))
 dict[key] = newvalue
for k,v in dict.items():
 rowstr = k+"\t"
 ncols = len(v)
 for items in v:
 cols=""
 for item in items:
 cols +=item+","
 cols = cols[0:len(cols)-1]
 rowstr += cols+"\t"
 print rowstr

Sample input

3,15,75 1635,1762 878 3425 121,122,123
15 1762 871 3475 121,125,126
3 1585,1590,5192 882,832,841 3200,3211 120,121,122,123,124

I'm getting the results correctly as expected, but I would like to know any improvements to make on the code.

Question 2

Indentation matters a lot in Python. Please follow the PEP 8 standard, which is 4 spaces.

This code feels very procedural. Python can be much more expressive than that, if you take advantage of features like list comprehensions.

You've used a lot of variables: line, row, l, keyvalues, items, item, key, value, i, prevval, cols, newvalue, newlist, j, newset — too many for me to keep track of what each represents. You could reduce the mental load by eliminating variables for ephemeral expressions. For example, you could eliminate items by writing for item in keyvalues.split(",").

I see that you considered using the csv module, but didn't. If you did, you could have simplified

for line in sys.stdin:
 line = line.strip()
 row = line.split('\t')

... to for row in csv.reader(fileinput.input(), delimiter='\t'). I recommend fileinput instead of sys.stdin so that the user has the option of passing a filename on the command line.

The getVal(k) function can just be replaced by dict.get(k, None). It would be better to avoid choosing dict as the name of a variable, since it shadows the dict() constructor.

Dictionary usage

Your getVal function is the get method of standard dictionaries. Use dict.get(key) instead of getVal(key). However, 3 things that are wrong with it:

It uses the global variable dict to retrieve values from, you should have parametrize it with the dictionary instead:
```
def getVal(dict_, key):
 try:
 v = dict_[key]
 except:
 v = None
 return v
```
It uses dict as a variable name, shadowing the builtin function.
It uses a raw except, catching the expected KeyError but potentially catching more serious issues. Always specify the kind of exception you are expecting.

Now for getVal usage, you set a value into dict[key] if getVal returned None otherwise you process further. In such case, you will set the value because the flow got into the except in your getVal function, meaning key is not in dict. It's quite simple to test in python:

if key not in dict:
 dict[key] = value
else:
 prevval = ...

No need for getVal anymore...

200_success 200_success 145k22 gold badges190 silver badges478 bronze badges · Accepted Answer · 2016-04-29 08:34:52Z

Indentation matters a lot in Python. Please follow the PEP 8 standard, which is 4 spaces.

This code feels very procedural. Python can be much more expressive than that, if you take advantage of features like list comprehensions.

You've used a lot of variables: line, row, l, keyvalues, items, item, key, value, i, prevval, cols, newvalue, newlist, j, newset — too many for me to keep track of what each represents. You could reduce the mental load by eliminating variables for ephemeral expressions. For example, you could eliminate items by writing for item in keyvalues.split(",").

I see that you considered using the csv module, but didn't. If you did, you could have simplified

for line in sys.stdin:
 line = line.strip()
 row = line.split('\t')

... to for row in csv.reader(fileinput.input(), delimiter='\t'). I recommend fileinput instead of sys.stdin so that the user has the option of passing a filename on the command line.

The getVal(k) function can just be replaced by dict.get(k, None). It would be better to avoid choosing dict as the name of a variable, since it shadows the dict() constructor.

Stack Exchange Network

Combining values corresponding to the same key

Sample input

2 Answers 2

Suggested solution

Dictionary usage

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Combining values corresponding to the same key

Sample input

2 Answers 2

Suggested solution

Dictionary usage

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions