3

I have a list with sublists in it. EG: ([1, 2], [1, 56], [2, 787], [2, 98], [3, 90]) which is created by appending values to it while running a for loop.

I am working in python, and i want to add the 2nd element of each sublist where the 1st elements are same. in my eg: i want to add 2+56 (both have 1st index as 1), 787+98(both have 1st index as 2) and keep 90 as it is because there is just one element with 1st index as 3.

I'm not sure how to do this.

Here is my code:

import urllib, re
from itertools import groupby
import collections
import itertools, operator
text = urllib.urlopen("some html page").read() 
data = re.compile(r'.*?<BODY>(.*?)<HR>', re.DOTALL).match(text).group(1)// storing contents from the BODY tag
values = [line.split() for line in data.splitlines()] //List with the BODY data
/* values contain elements like [[65, 67], [112, 123, 12], [387, 198, 09]]
 it contains elements with length 2 and three. 
 i am just concerned with elements with length 3
 in the for loop, i am doing this, and passing it to 2 functions.*/
def function1 (docid, doclen, tf):
 new=[];
 avgdoclen = 288;
 tf = float(x[2]);
 doclen = float(x[1]);
 answer1 = tf / (tf + 0.5 + (1.5*doclen/avgdoclen));
 q = function2(docid, doclen, tf)
 production = answer1 * q //this is the production of 
 new.append(docid) // i want to add all the production values where docid are same.
 new.append(production)
 return answer1
def function2 (docid, doclen, tf):
 avgdoclen = 288;
 querylen = 12;
 tf= float(x[2]);
 answer2 = tf/(tf + 0.5 + (1.5*querylen/avgdoclen));
 return answer2
for x in values:
 if len(x)==3:
 okapi_doc(x[0], x[1], x[2])
 okapi_query(x[0], x[1], x[2])

I want to add all the production values where the docid are same. Now when i print new, i get the following output:

['112', 0.3559469323909391]
['150', 0.31715060007742935]
['158', 0.122025819265144]
['176', 0.3862207694241891]
['188', 0.5057900225015092]
['236', 0.12628982528263102]
['251', 0.12166336633663369]

this is not a list. when i print new[0][0] i get 1. I want to get 112 when i print new[0][0]. Is there something wrong with append? ['334', 0.5851519557155408]

asked Feb 4, 2012 at 23:29
8
  • Well the first thing I see is that in function1, you create production and new and then throw both away. Commented Feb 5, 2012 at 0:43
  • Append works fine. What do you think happens to new after the function exits? You have to return it and put it in a list to get a list of news. Commented Feb 5, 2012 at 0:47
  • @ghbhatt: "new" obviously isn't the whole list, it's simply the temporary name you give to each of the 2-element lists there. new[0] is the first entry, i.e. the string "112", and new[0][0] is the first character of that string, i.e. "1". You're not actually accumulating anything, because as senderle notes, you throw it away. Commented Feb 5, 2012 at 0:49
  • Also, why are the semicolons in this code? Commented Feb 5, 2012 at 0:49
  • @DSM: I want to store the docids and production values in a new list. how can i do that? please help. I am new to python, and hence the syntax errors. Commented Feb 5, 2012 at 1:03

4 Answers 4

2

This is pretty straightforward. dict.get(key, default) returns the value if the key exists, or a default.

totals = {}
for k,v in data:
 totals[k] = totals.get(k, 0) + v
answered Feb 5, 2012 at 0:09
Sign up to request clarification or add additional context in comments.

5 Comments

Your code is the same as: from collections import defaultdict;totals = defaultdict(int); for k, v in data: totals[k] += v
And based on the comments to questions below, your version and my modification will not work.
@hughdbrown, you're misunderstanding the comments. They're referring to the Counter-based solution. The defaultdict-based solution is correct.
@senderle -- interesting. I thought that Counter was just a python 2.7 specialization of collections.defaultdict(int).
@hughdbrown, it is, but Counter just counts instances of a particular key in a flat sequence. It doesn't do any summation or anything like that. In other words, for Counter to be helpful here, you'd have to pass it a list like this: [1, 1, 1, 1, 1, 1, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3....].
2

This might be a job for itertools:

>>> import itertools, operator
>>> l = sorted([[1, 2], [1, 56], [2, 787], [2, 98], [3, 90]])
>>> keys_groups = itertools.groupby(l, key=operator.itemgetter(0))
>>> sums = [[key, sum(i[1] for i in group)] for key, group in keys_groups]
>>> sums
[[1, 58], [2, 885], [3, 90]]

Note that for groupby to work as expected, the items have to be sorted by the key given. In this case, since the key is the first item in the pair, I didn't have to do this, but for a more general solution, you should use a key parameter to sort the list.

>>> l2 = [[787, 2], [98, 2], [90, 3], [2, 1], [56, 1]]
>>> l2.sort(key=operator.itemgetter(1))
>>> l2
[[2, 1], [56, 1], [787, 2], [98, 2], [90, 3]]
>>> keys_groups = itertools.groupby(l2, key=operator.itemgetter(1))
>>> sums = [[key, sum(i[0] for i in group)] for key, group in keys_groups]
>>> sums
[[1, 58], [2, 885], [3, 90]]

Works fine with the data you posted. I edited it a bit to make the example more realistic.

>>> l = [['112', 0.3559469323909391], ['150', 0.31715060007742935], 
 ['158',0.122025819265144], ['176', 0.3862207694241891],
 ['188', 0.5057900225015092], ['377', 0.12628982528263102], 
 ['251', 0.12166336633663369], ['334', 0.5851519557155408], 
 ['334', 0.14663484486873507], ['112', 0.2345038167938931], 
 ['377', 0.10694516971279373], ['112', 0.28981132075471694]]
>>> l.sort(key=operator.itemgetter(0))
>>> keys_groups = itertools.groupby(l, key=operator.itemgetter(0))
>>> sums = [[key, sum(i[1] for i in group)] for key, group in keys_groups]
>>> sums
[['112', 0.88026206993954914], ['150', 0.31715060007742935], 
 ['158', 0.122025819265144], ['176', 0.38622076942418909], 
 ['188', 0.50579002250150917], ['251', 0.12166336633663369], 
 ['334', 0.73178680058427581], ['377', 0.23323499499542477]]

Note that as WolframH points out, sorting will generally increase the time complexity; but Python's sort algorithm is smart enough to make use of runs in data, so it might not -- it all depends on the data. Still, if your data is highly anti-sorted, Winston Ewert's defaultdict-based solution may be better. (But ignore that first Counter snippet -- I have no idea what's going on there.)

A couple of notes on how to create a list -- there are lots of ways, but the two basic ways in Python are as follows -- first a list comprehension:

>>> def simple_function(x):
... return [x, x ** 2]
... 
>>> in_data = range(10)
>>> out_data = [simple_function(x) for x in in_data]
>>> out_data
[[0, 0], [1, 1], [2, 4], [3, 9], [4, 16], [5, 25], [6, 36], [7, 49], [8, 64], [9, 81]]

And second, a for loop:

>>> out_data = []
>>> for x in in_data:
... out_data.append(simple_function(x))
... 
>>> out_data
[[0, 0], [1, 1], [2, 4], [3, 9], [4, 16], [5, 25], [6, 36], [7, 49], [8, 64], [9, 81]]
answered Feb 4, 2012 at 23:49

8 Comments

it gives me the following error: "sums = [[key, sum(i[1] for i in group)] for key, group in keys_groups] TypeError: 'float' object is not subscriptable"
@ghbhatt: please print out what your list is; this code should work perfectly.
sorting increases the time complexity from O(n) to O(n log n).
@ghbhatt: those terms work fine with this code, you're missing something. Could you post the results of set(tuple(map(type, x)) for x in whatever_your_list_is_called)? Something there isn't what you think it is.
@WolframH, good point. But it does depend on the data; timsort is pretty clever about taking advantage of runs in semi-sorted data.
|
1
import collections
result = collections.defaultdict(int) # works like a dictionary
# but all keys have a default value of zero
for key, value in mylist:
 result[key] += value 
print result
answered Feb 4, 2012 at 23:32

12 Comments

it gives me the following error: "for key, value in new_list: ValueError: too many values to unpack"
Your first bit of code there won't work, since 'dict'-ing the list of lists has key collisions.
Your Counter example doesn't work. dict keeps only the last value for each key.
@ghbhatt, make sure all your "pairs" in the list are actually pairs. There's one item in there that contains three or more items.
@RobWouters: yes indeed there is. when i run the 1st version, i get the following error: "print collections.Counter(dict(new_list)) ValueError: dictionary update sequence element #0 has length 3; 2 is required." but when i print len(new_list) it shows 2 for every element. i do not understand where the value is 3.
|
0

The fact that you:

want to add the 2nd element of each sublist where the 1st elements are same

makes me think that you want to be using a dict rather than a list - a dict is optimised for retrieving the 2nd value based on the 1st

Some code along the lines of:

oldvalue = mydict.get(firstvalue, 0)
newvalue = oldvalue + secondvalue
mydict[firstvalue] = newvalue

would let you build up the dict as you go - or if that's not feasible, it will let you calculate the sums in only a single pass over the list.

Quick spin in the interpreter just to test this out:

>>> l = [[1, 2], [1, 56], [2, 787], [2, 98], [3, 90]]
>>> mydict = {}
>>> for firstvalue, secondvalue in l:
... oldvalue = mydict.get(firstvalue, 0)
... newvalue = oldvalue + secondvalue
... mydict[firstvalue] = newvalue
... 
>>> print mydict
{1: 58, 2: 885, 3: 90}

Looks fairly close to what you want.

answered Feb 5, 2012 at 2:40

Comments

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.