So I've run into a few occasions where I'd want to sort a list according to a criteria, then for all loosely sorted sublists (a sequence of equal elements), I'd want to sort those sublists according to another criteria. I previously did something like this (assuming there are m
criteria.
sorted(sorted(sorted...(sorted(lst, key = keys[m]), key = keys[m-1]), ..., key = keys[1]), key = keys[0]).
Generally, I had around two sorting criteria so it wasn't an issue. Working on a problem that required sorting according to 3 criteria led me to generalise my method, which led me to noticing that it had atrocious resource usage (best case time complexity is O(m*s)
(where s
is the time complexity of sorted()
, and best case space complexity is O(m*n)
(where n
is the length of the list). This prompted me to seek more efficient ways of doing this, and after some thought, discussion and feedback from others, here's what I have:
ChainedSort.py
from collections import deque
"""
* Finds all sublists that have equal elements (according to a provided function) in a given sorted list.
* Params:
* `lst`: The sorted list to be searched.
* `f`: The function that is used to compare the items in the list.
* Return:
* A list containing 2 tuples of the form `(start, stop)` where `start` and `stop` give the slice of equal values in the list. If no elements in the list are equal this is empty.
"""
def checkSort(lst, f = lambda x: x): #Finds all sublists that have equal elements in a given sorted list.
i, slices, ln = 0, [], len(lst)
while i < ln-1:
if f(lst[i]) == f(lst[i+1]):
j = i+1
while f(lst[i]) == f(lst[j]):
if j+1 == ln: break
j += 1
slices.append((i, j)) #The slice can be directly accessed using `lst[tpl[0]:tpl[1]]` where `tpl` is the tuple denoting the slice.
i = j
continue #`i` should not be further increased.
i += 1 #Increment `i` normally if `i+1` != `i`.
return slices
"""
* Returns a sorted list whose elements are drawn from a given iterable, according to a list of provided keys.
* It is equivalent to a lazily evaluated form of `sorted(lst, key = lambda x: (key1(x), key2(x), key3(x) [,..., keym(x)]))`. The lazy evaluation provides the same `O(s)` (where `s` is the time complexity of `sorted()`) best case as a simple `sorted()` with only one key. On the other hand, the best case of the above (due to its strict evaluation) is `O(m*s)` where `m` is the number of supplied keys. This function would be very useful in cases where some of the key functions are expensive to evaluate. The process can be described as:
* Elements are ordered according to criterion 1.
* Elements equal by criterion 1 are ordered according to criterion 2.
* ...
* ...
* ...
* Elements equal by criterion i are ordered according to criterion i+1.
* ...
* ...
* ...
* Elements equal by criterion n-1 are ordered according to criterion n.
* This process is being referred to as "Chained Sort".
* Algorithm for Chained Sort:
* Step 1: lst := sorted(itr, keys[i]), ln := len(keys)
* Step 2: i := i+1
* Step 3: If i < ln and some elements in the list are equal according to criterion i:
* Step 4: while there exist elements equal according to criterion i:
* Step 5: Find the slices of all elements equal according to criterion i.
* Step 6: Sort those slices according to criterion i.
* Step 7: i := i+1
* [End of Step 4 while loop.]
* [End of Step 3 if block.]
* Step 8: Return lst.
* Params:
* `itr`: The iterable to be sorted.
* `keys`: A deque containing all the sorting keys, in the order in which they are to be evaluated.
* Return:
* `lst`: The sorted contents of `itr`, after applying the chained sort.
"""
def chainedSort(itr, keys = deque([lambda x: x])):
lst = sorted(itr, key = keys.popleft())
check = [] if not keys else checkSort(lst, keys.popleft())
while keys and check:
k = keys.popleft()
for tpl in check: #Sort all equal slices with the current key.
i, j = tpl[0], tpl[1]
lst[i:j] = sorted(lst[i:j], key = k)
if keys:
k, tmp = keys.popleft(), []
"""
* We only need to check those slices of the list that were not strictly sorted under the previous key. Slices of the list that were strictly sorted under a previous key should not be sorted under the next key. As such, rather than iterating through the list every time to find the loosely sorted elements, we only need to search among the loosely sorted elements under the previous key, as the set of loosely sorted elements cannot increase upon successive sorts.
"""
for tpl in check:
i, j = tpl[0], tpl[1]
tmp.extend(checkSort(lst[i:j], k))
check = tmp
else: check = []
return lst
2 Answers 2
I don't see any good reason to require that the keys are put in a deque. It feels like clobbering the calling site. Instead I’d use variable number of arguments to ease usage:
def chained_sort(iterable, *keys):
if not keys:
keys = deque([lambda x: x])
else:
keys = deque(keys)
# etc
Also note:
- that using a mutable default argument is prone to errors;
- the use of
snake_case
for the function name to conform to PEP8; - it would be a good idea to add a
reverse
argument to mimic thesorted
signature.
Now I have the feeling that the sorting behaviour of Python already does what you want. No need to extend it. In fact, sorting tuples in Python already uses multiple stages in that it sort on the first element first, then on the second element, then on the third, if any, and so on:
>>> sorted([(1, 2, 3), (4, 5, 6), (1, 1, 1), (2, 2, 2), (4, 4, 4), (2, 1, 2), (2, 1, 0)])
[(1, 1, 1), (1, 2, 3), (2, 1, 0), (2, 1, 2), (2, 2, 2), (4, 4, 4), (4, 5, 6)]
So you can leverage this behaviour to implement yours: just have the key argument of sorted
return a tuple of all the keys for each item, and let Python sort those tuples:
def chained_sort(iterable, *keys, reverse=False):
if not keys:
key = None
else:
def key(item):
return tuple(f(item) for f in keys)
return sorted(iterable, key=key, reverse=reverse)
Usage being
>>> import operator
>>> a = [{'one': 'foo', 'two': 42}, {'one': 'bar', 'two': 1337}, {'one': 'baz', 'two': 1664}, {'one': 'foo', 'two': 1492}, {'one': 'bar', 'two': 2}, {'one': 'baz', 'two': 0}]
>>> chained_sort(a, operator.itemgetter('one'), lambda x: x['two'] < 1000)
[{'one': 'bar', 'two': 1337}, {'one': 'bar', 'two': 2}, {'one': 'baz', 'two': 1664}, {'one': 'baz', 'two': 0}, {'one': 'foo', 'two': 1492}, {'one': 'foo', 'two': 42}]
Lastly docstrings should come just after the function declaration, not before. Read PEP257 for indsights.
-
\$\begingroup\$ As far as I'm aware, Python's inbuilt
sorted()
doesn't have lazy evaluation, which means that all key functions are evaluated for each item in the iterable? If I'm wrong about this please correct me. I'll add areverse
parameter, and look into replacing thedeque()
. I am somewhat unclear on camel case, since that's how I learned to code, and the code in my book ("Automate the Boring Stuff with Python") uses it. \$\endgroup\$Tobi Alafin– Tobi Alafin2019年02月12日 13:55:00 +00:00Commented Feb 12, 2019 at 13:55 -
1\$\begingroup\$ Right, you’ll have to call each key on each item and then sort according to the resulting list of tuples. But as far as I can tell, your code eventually does call each key on each item as well; just not necessarily in the same order. \$\endgroup\$301_Moved_Permanently– 301_Moved_Permanently2019年02月12日 13:59:05 +00:00Commented Feb 12, 2019 at 13:59
-
\$\begingroup\$ It doesn't call each key on each item if it's not needed. It evaluates the key function lazily. If items are sorted according to the 1st key, it stops there. They both have the same worst case, but as written, my code would perform better on average (and especially better if the last key functions are costly to evaluate). \$\endgroup\$Tobi Alafin– Tobi Alafin2019年02月14日 01:26:13 +00:00Commented Feb 14, 2019 at 1:26
This algorithm does not sort anything. In fact, it is broken in at least two ways. Let's illustrate that with a simple example:
import operator, itertools
array = list(zip(range(10), itertools.repeat(5), reversed(range(10))))
keys = deque(map(operator.itemgetter, range(3)))
print(chainedSort(array, keys))
print(keys)
Here we are contructing a list of 3-tuples that are already sorted, and we are asking to sort it according to their first element first, then their second one and then the third one. Since their first element is sorted, the principles behind your algorithm should make it so that the second and third comparison function are never executed (since there is no repeating elements according to the first comparison function). However the results are very deceptive:
[(8, 5, 1), (7, 5, 2), (6, 5, 3), (5, 5, 4), (4, 5, 5), (3, 5, 6), (2, 5, 7), (1, 5, 8), (0, 5, 9), (9, 5, 0)]
deque([])
Not only is it not sorted in any meaningful way, but the keys have been completely depleted; meaning that all 3 functions have been called. This is something that was meant to be avoided. Also note that the keys
passed to this function are modified in-place, which is an unexpected side-effect that should at least be documented, at best avoided. Using variable number of arguments is prefered here to simplify the calling site:
def chained_sort(iterable, *keys):
if not keys:
return sorted(iterable)
keys = deque(keys)
# rest of the code
Now about that "sorting" behaviour, I see at least 3 issues:
- you check for item equality with the next key function instead of the same that you just used to sort; this is not desirable as you are trying to find groups of elements that compare equal according to the actual sort so you can sort them with the following function: you must use the same key function that you used to sort; this is why in my example the result is mostly reversed, the second key function find the whole list equal so your algorithm end up sorting it according to the third key function;
- your index management is a mess in
checkSort
and mainly lead to skipping the last element in your slices; - you call
checkSort
on sub-lists but apply the resulting slices on the original list, meaning you won't sort sub-lists but random bit of the original list.
Fixing the second point is easy enough as you really don't need to special case for i+1
before assigning to j
. Just use a simpler loop and then find out if the slice [i:j]
represents more than one element to add it to slices
. A simple substraction tells you how many items are in the slice. Two other points worth noting: you should cache the value of f(lst[i])
instead of recomputing it for all elements that compare equal, and you can use a slice
object instead of storing these weird tuples:
>>> a = 'just a testing string'
>>> s = slice(7, 14)
>>> a[s]
'testing'
>>> a[7:14]
'testing'
Also transforming this function into a generator can simplify its writting:
def check_sort(array, cmp_function):
i, ln = 0, len(array)
while i < ln-1:
key = cmp_function(array[i])
j = i + 1
while j != ln and key == cmp_function(array[j]):
j += 1
if j - i > 1: # Keep only slices of more than 1 elements
yield slice(i, j)
i = j
But while loops feels weird so I’d rather write it like:
def check_sort(array, cmp_function):
i = None
key = None
for j, item in enumerate(array):
if i is None:
i = j
key = cmp_function(item)
continue
keyed = cmp_function(item)
if keyed != key:
if j - i > 1:
yield slice(i, j)
i = j
key = keyed
if i is not None and i != j:
# Special case of the last slice as it can't be
# generated from the loop. Be sure to check that
# 1) it exists and 2) it is more than 1 element.
yield slice(i, j + 1)
But, all in all, we’re still left with 2 issues.
So, what you really need is a way to group elements by the same key you used to sort them. Fortunately, this is exactly what itertools.groupby
is for. Once your groups are formed, just check their length to know if you must apply the next key function on this very same group or if you are done for this path. The simplest implementation uses recursion:
import itertools
def check_sorted(iterable, key, next_keys=(), reverse=False):
for _, group in itertools.groupby(iterable, key=key):
grouped = list(group)
if len(grouped) > 1:
yield chained_sort(grouped, *next_keys, reverse=reverse)
else:
yield grouped
def chained_sort(iterable, *keys, reverse=False):
if not keys:
return sorted(iterable, reverse=reverse)
keys = iter(keys)
key = next(keys)
result = sorted(iterable, key=key, reverse=reverse)
return list(itertools.chain.from_iterable(check_sorted(result, key, keys, reverse)))
Since recursion is limited in Python, you’ll have to write a looping version if you plan on supporting more than 500 key functions at once.
Explore related questions
See similar questions with these tags.
input = [[0, -1, 1], [1, 2, 3]]
, andoutput = [[1, 2, 3], [-1, 0, 1]]
. \$\endgroup\$