list manipulation in python

Question 1

I have a list with sublists in it. EG: ([1, 2], [1, 56], [2, 787], [2, 98], [3, 90]) which is created by appending values to it while running a for loop.

I am working in python, and i want to add the 2nd element of each sublist where the 1st elements are same. in my eg: i want to add 2+56 (both have 1st index as 1), 787+98(both have 1st index as 2) and keep 90 as it is because there is just one element with 1st index as 3.

I'm not sure how to do this.

Here is my code:

import urllib, re
from itertools import groupby
import collections
import itertools, operator
text = urllib.urlopen("some html page").read() 
data = re.compile(r'.*?<BODY>(.*?)<HR>', re.DOTALL).match(text).group(1)// storing contents from the BODY tag
values = [line.split() for line in data.splitlines()] //List with the BODY data
/* values contain elements like [[65, 67], [112, 123, 12], [387, 198, 09]]
 it contains elements with length 2 and three. 
 i am just concerned with elements with length 3
 in the for loop, i am doing this, and passing it to 2 functions.*/
def function1 (docid, doclen, tf):
 new=[];
 avgdoclen = 288;
 tf = float(x[2]);
 doclen = float(x[1]);
 answer1 = tf / (tf + 0.5 + (1.5*doclen/avgdoclen));
 q = function2(docid, doclen, tf)
 production = answer1 * q //this is the production of 
 new.append(docid) // i want to add all the production values where docid are same.
 new.append(production)
 return answer1
def function2 (docid, doclen, tf):
 avgdoclen = 288;
 querylen = 12;
 tf= float(x[2]);
 answer2 = tf/(tf + 0.5 + (1.5*querylen/avgdoclen));
 return answer2
for x in values:
 if len(x)==3:
 okapi_doc(x[0], x[1], x[2])
 okapi_query(x[0], x[1], x[2])

I want to add all the production values where the docid are same. Now when i print new, i get the following output:

['112', 0.3559469323909391]
['150', 0.31715060007742935]
['158', 0.122025819265144]
['176', 0.3862207694241891]
['188', 0.5057900225015092]
['236', 0.12628982528263102]
['251', 0.12166336633663369]

this is not a list. when i print new[0][0] i get 1. I want to get 112 when i print new[0][0]. Is there something wrong with append? ['334', 0.5851519557155408]

Question 2

Well the first thing I see is that in function1, you create production and new and then throw both away.

Question 3

Append works fine. What do you think happens to new after the function exits? You have to return it and put it in a list to get a list of news.

Question 4

@ghbhatt: "new" obviously isn't the whole list, it's simply the temporary name you give to each of the 2-element lists there. new[0] is the first entry, i.e. the string "112", and new[0][0] is the first character of that string, i.e. "1". You're not actually accumulating anything, because as senderle notes, you throw it away.

Question 5

Also, why are the semicolons in this code?

Question 6

@DSM: I want to store the docids and production values in a new list. how can i do that? please help. I am new to python, and hence the syntax errors.

Question 7

This is pretty straightforward. dict.get(key, default) returns the value if the key exists, or a default.

totals = {}
for k,v in data:
 totals[k] = totals.get(k, 0) + v

Question 8

Your code is the same as: from collections import defaultdict;totals = defaultdict(int); for k, v in data: totals[k] += v

Question 9

And based on the comments to questions below, your version and my modification will not work.

Question 10

@hughdbrown, you're misunderstanding the comments. They're referring to the Counter-based solution. The defaultdict-based solution is correct.

Question 11

@senderle -- interesting. I thought that Counter was just a python 2.7 specialization of collections.defaultdict(int).

Question 12

@hughdbrown, it is, but Counter just counts instances of a particular key in a flat sequence. It doesn't do any summation or anything like that. In other words, for Counter to be helpful here, you'd have to pass it a list like this: [1, 1, 1, 1, 1, 1, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3....].

Question 13

This might be a job for itertools:

>>> import itertools, operator
>>> l = sorted([[1, 2], [1, 56], [2, 787], [2, 98], [3, 90]])
>>> keys_groups = itertools.groupby(l, key=operator.itemgetter(0))
>>> sums = [[key, sum(i[1] for i in group)] for key, group in keys_groups]
>>> sums
[[1, 58], [2, 885], [3, 90]]

Note that for groupby to work as expected, the items have to be sorted by the key given. In this case, since the key is the first item in the pair, I didn't have to do this, but for a more general solution, you should use a key parameter to sort the list.

>>> l2 = [[787, 2], [98, 2], [90, 3], [2, 1], [56, 1]]
>>> l2.sort(key=operator.itemgetter(1))
>>> l2
[[2, 1], [56, 1], [787, 2], [98, 2], [90, 3]]
>>> keys_groups = itertools.groupby(l2, key=operator.itemgetter(1))
>>> sums = [[key, sum(i[0] for i in group)] for key, group in keys_groups]
>>> sums
[[1, 58], [2, 885], [3, 90]]

Works fine with the data you posted. I edited it a bit to make the example more realistic.

>>> l = [['112', 0.3559469323909391], ['150', 0.31715060007742935], 
 ['158',0.122025819265144], ['176', 0.3862207694241891],
 ['188', 0.5057900225015092], ['377', 0.12628982528263102], 
 ['251', 0.12166336633663369], ['334', 0.5851519557155408], 
 ['334', 0.14663484486873507], ['112', 0.2345038167938931], 
 ['377', 0.10694516971279373], ['112', 0.28981132075471694]]
>>> l.sort(key=operator.itemgetter(0))
>>> keys_groups = itertools.groupby(l, key=operator.itemgetter(0))
>>> sums = [[key, sum(i[1] for i in group)] for key, group in keys_groups]
>>> sums
[['112', 0.88026206993954914], ['150', 0.31715060007742935], 
 ['158', 0.122025819265144], ['176', 0.38622076942418909], 
 ['188', 0.50579002250150917], ['251', 0.12166336633663369], 
 ['334', 0.73178680058427581], ['377', 0.23323499499542477]]

Note that as WolframH points out, sorting will generally increase the time complexity; but Python's sort algorithm is smart enough to make use of runs in data, so it might not -- it all depends on the data. Still, if your data is highly anti-sorted, Winston Ewert's defaultdict-based solution may be better. (But ignore that first Counter snippet -- I have no idea what's going on there.)

A couple of notes on how to create a list -- there are lots of ways, but the two basic ways in Python are as follows -- first a list comprehension:

>>> def simple_function(x):
... return [x, x ** 2]
... 
>>> in_data = range(10)
>>> out_data = [simple_function(x) for x in in_data]
>>> out_data
[[0, 0], [1, 1], [2, 4], [3, 9], [4, 16], [5, 25], [6, 36], [7, 49], [8, 64], [9, 81]]

And second, a for loop:

>>> out_data = []
>>> for x in in_data:
... out_data.append(simple_function(x))
... 
>>> out_data
[[0, 0], [1, 1], [2, 4], [3, 9], [4, 16], [5, 25], [6, 36], [7, 49], [8, 64], [9, 81]]

Question 14

it gives me the following error: "sums = [[key, sum(i[1] for i in group)] for key, group in keys_groups] TypeError: 'float' object is not subscriptable"

Question 15

@ghbhatt: please print out what your list is; this code should work perfectly.

Question 16

sorting increases the time complexity from O(n) to O(n log n).

Question 17

@ghbhatt: those terms work fine with this code, you're missing something. Could you post the results of set(tuple(map(type, x)) for x in whatever_your_list_is_called)? Something there isn't what you think it is.

Question 18

@WolframH, good point. But it does depend on the data; timsort is pretty clever about taking advantage of runs in semi-sorted data.

Question 19

import collections
result = collections.defaultdict(int) # works like a dictionary
# but all keys have a default value of zero
for key, value in mylist:
 result[key] += value 
print result

Question 20

it gives me the following error: "for key, value in new_list: ValueError: too many values to unpack"

Question 21

Your first bit of code there won't work, since 'dict'-ing the list of lists has key collisions.

Question 22

Your Counter example doesn't work. dict keeps only the last value for each key.

Question 23

@ghbhatt, make sure all your "pairs" in the list are actually pairs. There's one item in there that contains three or more items.

Question 24

@RobWouters: yes indeed there is. when i run the 1st version, i get the following error: "print collections.Counter(dict(new_list)) ValueError: dictionary update sequence element #0 has length 3; 2 is required." but when i print len(new_list) it shows 2 for every element. i do not understand where the value is 3.

Question 25

The fact that you:

want to add the 2nd element of each sublist where the 1st elements are same

makes me think that you want to be using a dict rather than a list - a dict is optimised for retrieving the 2nd value based on the 1st

Some code along the lines of:

oldvalue = mydict.get(firstvalue, 0)
newvalue = oldvalue + secondvalue
mydict[firstvalue] = newvalue

would let you build up the dict as you go - or if that's not feasible, it will let you calculate the sums in only a single pass over the list.

Quick spin in the interpreter just to test this out:

>>> l = [[1, 2], [1, 56], [2, 787], [2, 98], [3, 90]]
>>> mydict = {}
>>> for firstvalue, secondvalue in l:
... oldvalue = mydict.get(firstvalue, 0)
... newvalue = oldvalue + secondvalue
... mydict[firstvalue] = newvalue
... 
>>> print mydict
{1: 58, 2: 885, 3: 90}

Looks fairly close to what you want.

adriaticc 1562 bronze badges · Answer 1 · 2012-02-05 00:09:10Z

2

This is pretty straightforward. dict.get(key, default) returns the value if the key exists, or a default.

totals = {}
for k,v in data:
 totals[k] = totals.get(k, 0) + v

Share

Improve this answer

answered Feb 5, 2012 at 0:09

adriaticc's user avatar

adriaticc

1562 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

hughdbrown

hughdbrown Over a year ago

Your code is the same as: from collections import defaultdict;totals = defaultdict(int); for k, v in data: totals[k] += v

2012年02月05日T00:47:12.157Z+00:00

hughdbrown

hughdbrown Over a year ago

And based on the comments to questions below, your version and my modification will not work.

2012年02月05日T00:48:10.003Z+00:00

senderle

senderle Over a year ago

@hughdbrown, you're misunderstanding the comments. They're referring to the Counter-based solution. The defaultdict-based solution is correct.

2012年02月05日T00:52:05.42Z+00:00

hughdbrown

hughdbrown Over a year ago

@senderle -- interesting. I thought that Counter was just a python 2.7 specialization of collections.defaultdict(int).

2012年02月05日T07:05:39.687Z+00:00

senderle

senderle Over a year ago

@hughdbrown, it is, but Counter just counts instances of a particular key in a flat sequence. It doesn't do any summation or anything like that. In other words, for Counter to be helpful here, you'd have to pass it a list like this: [1, 1, 1, 1, 1, 1, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3....].

2012年02月05日T15:09:10.087Z+00:00

senderle 152k36 gold badges218 silver badges244 bronze badges · Answer 2 · 2012-02-04 23:49:50Z

This might be a job for itertools:

>>> import itertools, operator
>>> l = sorted([[1, 2], [1, 56], [2, 787], [2, 98], [3, 90]])
>>> keys_groups = itertools.groupby(l, key=operator.itemgetter(0))
>>> sums = [[key, sum(i[1] for i in group)] for key, group in keys_groups]
>>> sums
[[1, 58], [2, 885], [3, 90]]

Note that for groupby to work as expected, the items have to be sorted by the key given. In this case, since the key is the first item in the pair, I didn't have to do this, but for a more general solution, you should use a key parameter to sort the list.

>>> l2 = [[787, 2], [98, 2], [90, 3], [2, 1], [56, 1]]
>>> l2.sort(key=operator.itemgetter(1))
>>> l2
[[2, 1], [56, 1], [787, 2], [98, 2], [90, 3]]
>>> keys_groups = itertools.groupby(l2, key=operator.itemgetter(1))
>>> sums = [[key, sum(i[0] for i in group)] for key, group in keys_groups]
>>> sums
[[1, 58], [2, 885], [3, 90]]

Works fine with the data you posted. I edited it a bit to make the example more realistic.

>>> l = [['112', 0.3559469323909391], ['150', 0.31715060007742935], 
 ['158',0.122025819265144], ['176', 0.3862207694241891],
 ['188', 0.5057900225015092], ['377', 0.12628982528263102], 
 ['251', 0.12166336633663369], ['334', 0.5851519557155408], 
 ['334', 0.14663484486873507], ['112', 0.2345038167938931], 
 ['377', 0.10694516971279373], ['112', 0.28981132075471694]]
>>> l.sort(key=operator.itemgetter(0))
>>> keys_groups = itertools.groupby(l, key=operator.itemgetter(0))
>>> sums = [[key, sum(i[1] for i in group)] for key, group in keys_groups]
>>> sums
[['112', 0.88026206993954914], ['150', 0.31715060007742935], 
 ['158', 0.122025819265144], ['176', 0.38622076942418909], 
 ['188', 0.50579002250150917], ['251', 0.12166336633663369], 
 ['334', 0.73178680058427581], ['377', 0.23323499499542477]]

Note that as WolframH points out, sorting will generally increase the time complexity; but Python's sort algorithm is smart enough to make use of runs in data, so it might not -- it all depends on the data. Still, if your data is highly anti-sorted, Winston Ewert's defaultdict-based solution may be better. (But ignore that first Counter snippet -- I have no idea what's going on there.)

A couple of notes on how to create a list -- there are lots of ways, but the two basic ways in Python are as follows -- first a list comprehension:

>>> def simple_function(x):
... return [x, x ** 2]
... 
>>> in_data = range(10)
>>> out_data = [simple_function(x) for x in in_data]
>>> out_data
[[0, 0], [1, 1], [2, 4], [3, 9], [4, 16], [5, 25], [6, 36], [7, 49], [8, 64], [9, 81]]

And second, a for loop:

>>> out_data = []
>>> for x in in_data:
... out_data.append(simple_function(x))
... 
>>> out_data
[[0, 0], [1, 1], [2, 4], [3, 9], [4, 16], [5, 25], [6, 36], [7, 49], [8, 64], [9, 81]]

it gives me the following error: "sums = [[key, sum(i[1] for i in group)] for key, group in keys_groups] TypeError: 'float' object is not subscriptable"
@ghbhatt: please print out what your list is; this code should work perfectly.
sorting increases the time complexity from O(n) to O(n log n).
@ghbhatt: those terms work fine with this code, you're missing something. Could you post the results of set(tuple(map(type, x)) for x in whatever_your_list_is_called)? Something there isn't what you think it is.
@WolframH, good point. But it does depend on the data; timsort is pretty clever about taking advantage of runs in semi-sorted data.

Winston Ewert 45.2k10 gold badges70 silver badges86 bronze badges · Answer 3 · 2012-02-04 23:32:38Z

1

import collections
result = collections.defaultdict(int) # works like a dictionary
# but all keys have a default value of zero
for key, value in mylist:
 result[key] += value 
print result

Share

Improve this answer

edited Feb 5, 2012 at 2:12

answered Feb 4, 2012 at 23:32

Winston Ewert's user avatar

Winston Ewert

45.2k10 gold badges70 silver badges86 bronze badges

12 Comments

gsb

gsb Over a year ago

it gives me the following error: "for key, value in new_list: ValueError: too many values to unpack"

2012年02月04日T23:37:22.59Z+00:00

Gregg Lind

Gregg Lind Over a year ago

Your first bit of code there won't work, since 'dict'-ing the list of lists has key collisions.

2012年02月04日T23:38:03.53Z+00:00

Reinstate Monica

Reinstate Monica Over a year ago

Your Counter example doesn't work. dict keeps only the last value for each key.

2012年02月04日T23:38:30.64Z+00:00

Rob Wouters

Rob Wouters Over a year ago

@ghbhatt, make sure all your "pairs" in the list are actually pairs. There's one item in there that contains three or more items.

2012年02月04日T23:41:54.893Z+00:00

gsb

gsb Over a year ago

@RobWouters: yes indeed there is. when i run the 1st version, i get the following error: "print collections.Counter(dict(new_list)) ValueError: dictionary update sequence element #0 has length 3; 2 is required." but when i print len(new_list) it shows 2 for every element. i do not understand where the value is 3.

2012年02月04日T23:48:23.633Z+00:00

|

James Polley 8,2513 gold badges32 silver badges33 bronze badges · Answer 4 · 2012-02-05 02:40:07Z

The fact that you:

want to add the 2nd element of each sublist where the 1st elements are same

makes me think that you want to be using a dict rather than a list - a dict is optimised for retrieving the 2nd value based on the 1st

Some code along the lines of:

oldvalue = mydict.get(firstvalue, 0)
newvalue = oldvalue + secondvalue
mydict[firstvalue] = newvalue

would let you build up the dict as you go - or if that's not feasible, it will let you calculate the sums in only a single pass over the list.

Quick spin in the interpreter just to test this out:

>>> l = [[1, 2], [1, 56], [2, 787], [2, 98], [3, 90]]
>>> mydict = {}
>>> for firstvalue, secondvalue in l:
... oldvalue = mydict.get(firstvalue, 0)
... newvalue = oldvalue + secondvalue
... mydict[firstvalue] = newvalue
... 
>>> print mydict
{1: 58, 2: 885, 3: 90}

Looks fairly close to what you want.

CollectivesTM on Stack Overflow

list manipulation in python

4 Answers 4

5 Comments

8 Comments

12 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

CollectivesTM on Stack Overflow

4 Answers 4

5 Comments

8 Comments

12 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related