This issue tracker has been migrated to GitHub ,
and is currently read-only.
For more information,
see the GitHub FAQs in the Python's Developer Guide.
Created on 2015年02月20日 16:25 by wbolster, last changed 2022年04月11日 14:58 by admin. This issue is now closed.
| Files | ||||
|---|---|---|---|---|
| File name | Uploaded | Description | Edit | |
| repro-sorting.py | Tiger-222, 2021年09月02日 15:43 | |||
| Pull Requests | |||
|---|---|---|---|
| URL | Status | Linked | Edit |
| PR 8131 | merged | methane, 2018年07月06日 07:52 | |
| Messages (18) | |||
|---|---|---|---|
| msg236302 - (view) | Author: wouter bolsterlee (wbolster) * | Date: 2015年02月20日 16:25 | |
The JSON encoder uses a lambda function for the sort(key=...) invocation used to sort the keys in a JSON object in case sort_keys=True is passed: https://hg.python.org/cpython/file/46bfddb14cbe/Lib/json/encoder.py#l352 Instead of having a lambda, operator.itemgetter(0) can be used here, which is a minor performance gain. It should be noted that many other internals of the JSON encoder have also been fine-tuned (manually) for performance reasons. |
|||
| msg236345 - (view) | Author: Josh Rosenberg (josh.r) * (Python triager) | Date: 2015年02月21日 00:43 | |
Is it even legal to have non-string keys in a JSON object? If they must be strings, and they must be unique, I don't think a key argument is necessary (and it would save the generation of the key array; not doing the work is faster than doing the work more efficiently after all), since the default tuple comparison would work fine; the first element would always be unequal, so the second elements would never be compared, right? I'm not 100% on this with the rich comparison operator approach, but my attempts to trigger a failure haven't worked (TimSort or the tuple comparison, or both, are probably smarter about this than I am). |
|||
| msg236423 - (view) | Author: Raymond Hettinger (rhettinger) * (Python committer) | Date: 2015年02月23日 01:13 | |
I'm -0 on this proposal. There is a small speed-up to be had here: $ python3.4 -m timeit -s 'f=lambda kv: kv[0]' -s 's=[10, 20]' 'f(s)' 10000000 loops, best of 3: 0.105 usec per loop $ python3.4 -m timeit -s 'import operator' -s 'f=operator.itemgetter(0)' -s 's=[10, 20]' 'f(s)' 10000000 loops, best of 3: 0.0717 usec per loop That said, it is applied only n-times and is likely insignificant when compared to the O(n log n) sort. I think it is better to not add an intermodule dependency when we don't have to. Also if you're using CPython, then the C version of JSON is used instead of this code. So, any timings or optimizations should be tested against the other implementations (Jython or PyPy) that use them. In those implementations, I don't think itemgetter() has any performance benefit over the lambda. |
|||
| msg236434 - (view) | Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) | Date: 2015年02月23日 10:44 | |
CPython: $ python3.4 -m timeit -s 'f = lambda kv: kv[0]' -s 's = list(dict.fromkeys(range(1000)).items())' -- 'sorted(s, key=f)' 1000 loops, best of 3: 904 usec per loop $ python3.4 -m timeit -s 'import operator' -s 'f = operator.itemgetter(0)' -s 's = list(dict.fromkeys(range(1000)).items())' -- 'sorted(s, key=f)' 1000 loops, best of 3: 462 usec per loop PyPy: $ pypy -m timeit -s 'f = lambda kv: kv[0]' -s 's = list(dict.fromkeys(range(1000)).items())' -- 'sorted(s, key=f)' 1000 loops, best of 3: 1.23 msec per loop $ pypy -m timeit -s 'import operator' -s 'f = operator.itemgetter(0)' -s 's = list(dict.fromkeys(range(1000)).items())' -- 'sorted(s, key=f)' 1000 loops, best of 3: 1.27 msec per loop |
|||
| msg236435 - (view) | Author: STINNER Victor (vstinner) * (Python committer) | Date: 2015年02月23日 10:56 | |
If I remember correctly, the complexity and performance of sort/sorted depends if the data set is sorted or not. You may recreated the list/dictionary at each iteration to get performances closer to "items = sorted(dct.items(), key=lambda kv: kv[0])" (dict keys are not sorted by their content, especially with strings). |
|||
| msg236436 - (view) | Author: STINNER Victor (vstinner) * (Python committer) | Date: 2015年02月23日 11:09 | |
Oh, forget my comment. sorted() never changes the input list, so the microbenchmark is ok. |
|||
| msg236437 - (view) | Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) | Date: 2015年02月23日 11:10 | |
Unless we use recent PyPy with ordered dicts. |
|||
| msg236438 - (view) | Author: STINNER Victor (vstinner) * (Python committer) | Date: 2015年02月23日 11:15 | |
> That said, it is applied only n-times and is likely insignificant when compared to the O(n log n) sort. (...) 904 usec => 462 usec is very significant: it's 49% faster. So I'm ok for the change. Note: PyPy JIT may not be able to optimize operator.itemgetter, whereas it optimizes the simple lambda function. You may report this performance issue to PyPy. PyPy may use different code in json for best performances. |
|||
| msg236439 - (view) | Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) | Date: 2015年02月23日 11:26 | |
My interpretation of the results. In CPython using operator.itemgetter() makes sorting up to 2 times faster (only 25% faster with string keys), in PyPy it make sorting slightly slower (because operator.itemgetter() is implemented in Python, but the lambda is more specialized and could be better optimized). The lambda looks more clear to me, and I think there is the possibility to optimize CPython to replace the body of such functions with builtins from the operator module. |
|||
| msg236440 - (view) | Author: STINNER Victor (vstinner) * (Python committer) | Date: 2015年02月23日 11:30 | |
"My interpretation of the results. In CPython using operator.itemgetter() makes sorting up to 2 times faster" If I remember correctly, Python functions implemented in C don't create a Python frame. The Python frame is an important cost in term of performances. |
|||
| msg236441 - (view) | Author: wouter bolsterlee (wbolster) * | Date: 2015年02月23日 12:57 | |
Using IPython and CPython 3.4: >>> d = dict.fromkeys(map(str, range(1000))) Current implementation: >>> %timeit sorted(d.items(), key=lambda kv: kv[0]) 1000 loops, best of 3: 605 μs per loop >>> %timeit sorted(d.items(), key=lambda kv: kv[0]) 1000 loops, best of 3: 614 μs per loop Proposed change: >>> %timeit sorted(d.items(), key=operator.itemgetter(0)) 1000 loops, best of 3: 527 μs per loop >>> %timeit sorted(d.items(), key=operator.itemgetter(0)) 1000 loops, best of 3: 523 μs per loop Alternative without 'key' arg, since all keys in a JSON object must be strings, so the tuples from .items() can be compared directly.: >>> %timeit sorted(d.items()) 1000 loops, best of 3: 755 μs per loop >>> %timeit sorted(d.items()) 1000 loops, best of 3: 756 μs per loop As you can see, the operator approach seems the fastest on CPython 3.4, even faster than not having a 'key' arg at all (possibly because it avoids doing a tuple compare, which descends into a string compare for the first item). |
|||
| msg236442 - (view) | Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) | Date: 2015年02月23日 13:13 | |
Could you measure the effect on json.encode()? |
|||
| msg236545 - (view) | Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) | Date: 2015年02月24日 21:05 | |
May be there is a time to optimize creating Python frames (at least for the case when one frame is created and destroyed multiple times). May be frame pool? Or cached one frame per function? |
|||
| msg321157 - (view) | Author: Inada Naoki (methane) * (Python committer) | Date: 2018年07月06日 07:47 | |
In C version, we don't use key func.
items = PyMapping_Items(dct);
if (items == NULL)
goto bail;
if (s->sort_keys && PyList_Sort(items) < 0) {
Like that, how about removing key function completely?
|
|||
| msg321171 - (view) | Author: Inada Naoki (methane) * (Python committer) | Date: 2018年07月06日 12:32 | |
I can't confirm significant performance benefit. sort_keys is not bottleneck of pure Python encoder. BTW, benefit from C encoder is more significant. It's about 3x faster. Adding `indent` option support to C encoder may be good target for new contributors who want to write C. |
|||
| msg321195 - (view) | Author: Inada Naoki (methane) * (Python committer) | Date: 2018年07月06日 23:55 | |
New changeset e25399b40cd15620e77c9ad2ed24549006ae9b47 by INADA Naoki in branch 'master': bpo-23493: json: Change sort_keys in Python encoder same to C (GH-8131) https://github.com/python/cpython/commit/e25399b40cd15620e77c9ad2ed24549006ae9b47 |
|||
| msg400929 - (view) | Author: Mickaël Schoentgen (Tiger-222) * | Date: 2021年09月02日 15:43 | |
I am wondering why the change was not backported to 3.6 and 3.7?
It introduces different behavior.
For instance, I need to keep duplicate keys from JSON data (because it is allowed by the RFC and it is a missing feature for tools such like HTTPie).
Have a look at repro-sorting.py.
On Python 3.6 and 3.7, the output is not sorted:
{
"dps": {
"1630064726": 5.0,
"1630064726": 3.0,
"1630064726": 6.0
}
}
Starting with Python 3.8, the output is sorted as expected:
{
"dps": {
"1630064726": 3.0,
"1630064726": 5.0,
"1630064726": 6.0
}
}
I could open pull requests for both 3.6 and 3.7 branches, if you think it is worth and allowed by the current maintenance status.
|
|||
| msg400934 - (view) | Author: STINNER Victor (vstinner) * (Python committer) | Date: 2021年09月02日 16:04 | |
> I am wondering why the change was not backported to 3.6 and 3.7? These branches no longer accept optimizations or bugfixes, only security fixes: https://devguide.python.org/#status-of-python-branches |
|||
| History | |||
|---|---|---|---|
| Date | User | Action | Args |
| 2022年04月11日 14:58:13 | admin | set | github: 67681 |
| 2021年09月02日 16:04:31 | vstinner | set | messages: + msg400934 |
| 2021年09月02日 15:43:59 | Tiger-222 | set | files:
+ repro-sorting.py nosy: + Tiger-222 messages: + msg400929 |
| 2018年07月06日 23:55:06 | methane | set | messages: + msg321195 |
| 2018年07月06日 12:32:12 | methane | set | status: open -> closed resolution: rejected messages: + msg321171 stage: patch review -> resolved |
| 2018年07月06日 07:52:44 | methane | set | keywords:
+ patch stage: patch review pull_requests: + pull_request7709 |
| 2018年07月06日 07:47:44 | methane | set | nosy:
+ methane messages: + msg321157 |
| 2015年03月02日 07:58:25 | ezio.melotti | set | nosy:
+ ezio.melotti |
| 2015年02月24日 21:05:40 | serhiy.storchaka | set | messages: + msg236545 |
| 2015年02月24日 01:09:43 | Arfrever | set | nosy:
+ Arfrever |
| 2015年02月23日 13:13:05 | serhiy.storchaka | set | messages: + msg236442 |
| 2015年02月23日 13:09:10 | serhiy.storchaka | set | status: languishing -> open |
| 2015年02月23日 12:57:49 | wbolster | set | messages: + msg236441 |
| 2015年02月23日 11:30:10 | vstinner | set | messages: + msg236440 |
| 2015年02月23日 11:26:40 | serhiy.storchaka | set | priority: normal -> low status: open -> languishing messages: + msg236439 |
| 2015年02月23日 11:15:16 | vstinner | set | messages: + msg236438 |
| 2015年02月23日 11:10:40 | serhiy.storchaka | set | messages: + msg236437 |
| 2015年02月23日 11:09:48 | vstinner | set | messages: + msg236436 |
| 2015年02月23日 10:56:11 | vstinner | set | nosy:
+ vstinner messages: + msg236435 |
| 2015年02月23日 10:44:11 | serhiy.storchaka | set | nosy:
+ serhiy.storchaka messages: + msg236434 |
| 2015年02月23日 01:13:16 | rhettinger | set | versions:
- Python 2.7, Python 3.2, Python 3.3, Python 3.4, Python 3.6 nosy: + rhettinger messages: + msg236423 type: performance |
| 2015年02月21日 00:43:21 | josh.r | set | nosy:
+ josh.r messages: + msg236345 |
| 2015年02月20日 16:25:56 | wbolster | create | |