Issue 14828: itertools.groupby not working as expected

➜

This issue tracker has been migrated to GitHub , and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/59033

classification

Title:	itertools.groupby not working as expected
Type:	behavior	Stage:
Components:	Library (Lib)	Versions:	Python 2.7

process

Dependencies:	Superseder:
Status:	closed	Resolution:	not a bug
Assigned To:	Nosy List:	Jiba, petri.lehtinen
Priority:	normal	Keywords:

Created on 2012年05月16日 09:51 by Jiba, last changed 2022年04月11日 14:57 by admin. This issue is now closed.

Messages (4)
msg160822 - (view)	Author: Jiba (Jiba)	Date: 2012年05月16日 09:51
In some situation, itertools.groupby fails to group the objects, and produces several groups with the same key. For example, the following code : from itertools import * class P(object): def __init__(self, key): self.key = key p1 = P(1) p2 = P(2) p3 = P(1) for key, ps in groupby([p1, p2, p3], lambda p: p.key): print "group", key for p in ps: print " - object", p Produces the following result : group 1 - object <__main__.P object at 0xb73d6acc> group 2 - object <__main__.P object at 0xb73d6aec> group 1 - object <__main__.P object at 0xb73d6b0c> While I would expect to have only a single group 1, e.g. something like : group 1 - object <__main__.P object at 0xb73d6acc> - object <__main__.P object at 0xb73d6b0c> group 2 - object <__main__.P object at 0xb73d6aec> It seems that this bug also affects Python 3 (tested on Python 3.1.2)
msg160825 - (view)	Author: Petri Lehtinen (petri.lehtinen) * (Python committer)	Date: 2012年05月16日 10:23
groupby() changes the group when the key changes in the input it iterates. If you want to have p1 and p3 to go to the same group, you need to sort the input by P.key first. This is clearly documented, too: The operation of groupby() is similar to the uniq filter in Unix. It generates a break or new group every time the value of the key function changes (which is why it is usually necessary to have sorted the data using the same key function). That behavior differs from SQL’s GROUP BY which aggregates common elements regardless of their input order.
msg160828 - (view)	Author: Jiba (Jiba)	Date: 2012年05月16日 11:08
Ok, I understand. However, in my initial problem, the sequence passed to groupby was a set, e.g. (modifying my previous example) : groupby(set([p1, p2, p3]), lambda p: p.key) If I understand well how groupby() works, the result of a groupby performed on a set is unpredictable, since it depends of the order of the items when iterating over the set. Perhaps the behavior of groupby() should be modified for unsorted sequences, possibly not taking the order into account, or raising an error ?
msg160829 - (view)	Author: Petri Lehtinen (petri.lehtinen) * (Python committer)	Date: 2012年05月16日 11:17
You're right, the result over a set would be unpredictable. The point of the itertools module is to be able to a) cope with massive amounts of data and b) be a set of tools instead of complete solutions for all problems. Because of both of the points above, groupby() doesn't load all the data into memory or attempt to sort the data by itself. Furthermore, there's no way for groupby() to know whether the iterable it's passed is going to yield sorted or unsorted data. It's your responsibility to know whether the iterable you're passing is already sorted or not, and sort it first, if it's possible and there's a need to do so.

History
Date	User	Action	Args
2022年04月11日 14:57:30	admin	set	github: 59033
2012年05月16日 11:17:40	petri.lehtinen	set	messages: + msg160829
2012年05月16日 11:08:15	Jiba	set	messages: + msg160828
2012年05月16日 10:23:07	petri.lehtinen	set	status: open -> closed nosy: + petri.lehtinen messages: + msg160825 resolution: not a bug
2012年05月16日 09:51:34	Jiba	create

homepage