homepage

This issue tracker has been migrated to GitHub , and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: [subinterpreters] Get the current Python interpreter state from Thread Local Storage (autoTSSkey)
Type: Stage: patch review
Components: Subinterpreters Versions: Python 3.9
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: corona10, h-vetinari, pablogsal, seberg, shihai1991, vstinner
Priority: normal Keywords: patch

Created on 2020年05月05日 17:12 by vstinner, last changed 2022年04月11日 14:59 by admin.

Pull Requests
URL Status Linked Edit
PR 19939 merged vstinner, 2020年05月05日 17:19
PR 23976 closed vstinner, 2020年12月28日 15:32
PR 24575 merged vstinner, 2021年02月19日 11:50
PR 24596 closed corona10, 2021年02月20日 06:19
PR 24597 closed corona10, 2021年02月20日 06:24
Messages (11)
msg368182 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2020年05月05日 17:12
_PyThreadState_GET() gets the current Python thread state from _PyRuntime.gilstate.tstate_current atomic variable.
When I experimented per-interpreter GIL (bpo-40512), I got issues with _PyThreadState_GET() which didn't return the expected Python thread state.
I propose to modify _PyThreadState_GET() in the exprimental isolated subinterpreters mode to get and set the current Python thread state using a Thread Local Storage: autoTSSkey.
msg368188 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2020年05月05日 17:56
New changeset e838a9324c1719bb917ca81ede8d766b5cb551f4 by Victor Stinner in branch 'master':
bpo-40522: _PyThreadState_Swap() sets autoTSSkey (GH-19939)
https://github.com/python/cpython/commit/e838a9324c1719bb917ca81ede8d766b5cb551f4
msg380324 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2020年11月04日 13:58
See also bpo-15751: "Make the PyGILState API compatible with subinterpreters".
msg383894 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2020年12月28日 14:21
There are many ways to get the current interpreter (interp) and the current Python thread state (tstate).
Public C API, opaque function call:
* PyInterpreterState_Get() (new in Python 3.9)
* PyThreadState_Get()
Internal C API, static inline functions:
* _PyInterpreterState_GET()
* _PyThreadState_GET()
There are so many variants that I wrote notes for myself:
https://pythondev.readthedocs.io/pystate.html
This issue is about optimizing _PyInterpreterState_GET() and _PyThreadState_GET() which are supposed to be the most efficient implementations.
Currently, _PyInterpreterState_GET() is implemented as _PyThreadState_GET()->interp, and _PyThreadState_GET() is implemented as:
 _Py_atomic_load_relaxed(_PyRuntime.gilstate.tstate_current)
--
To find the _PyInterpreterState_GET() machine code, I read the PyInterpreterState_Get() assembly code (not optimized, it adds tstate==NULL test) and PyTuple_New() assembly code, since PyTuple_New() now needs to get the current interpreter:
static struct _Py_tuple_state *
get_tuple_state(void)
{
 PyInterpreterState *interp = _PyInterpreterState_GET();
 return &interp->tuple;
}
To find the _PyThreadState_GET() machine code, I read the PyThreadState_Get() assembly code.
I looked at the x86-64 machine code generated by GCC -O3 (no LTO, no PGO, it should not be relevant here), using GCC 10.2.1 on Fedora 33.
_PyThreadState_GET():
 mov rax,QWORD PTR [rip+0x2292b1] # 0x743118 <_PyRuntime+568>
_PyInterpreterState_GET():
 mov rax,QWORD PTR [rip+0x22a7dd] # 0x743118 <_PyRuntime+568>
 mov rax,QWORD PTR [rax+0x10]
By default, Python is built with -fPIC: _PyRuntime variable does not have a fixed address.
$ objdump -t ./python|grep '\<_PyRuntime\>'
0000000000742ee0 g O .bss	00000000000002a0 _PyRuntime
The "[rip+0x2292b1] # 0x743118 <_PyRuntime+568>" indirection is needed by PIC.
msg383895 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2020年12月28日 14:25
>_PyInterpreterState_GET():
> mov rax,QWORD PTR [rip+0x22a7dd] # 0x743118 <_PyRuntime+568>
> mov rax,QWORD PTR [rax+0x10]
While working on bpo-39465, I wrote PR 20767 to optimize _PyInterpreterState_GET(): single instruction instead of two:
* Add _PyRuntimeState.interp_current member: atomic variable
* _PyThreadState_Swap() sets _PyRuntimeState.interp_current
But I failed to measure any performance difference.
msg383899 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2020年12月28日 15:53
PR 23976 stores the currrent interpreter and the current Python thread state into a Thread Local Storage (TLS) using GCC/clang __thread keyword.
On x86-64 using LTO and GCC -O3, _PyThreadState_GET() and _PyInterpreterState_GET() become a single MOV.
Assembly code using LTO and gcc -O3.
_PyThreadState_GET() in _PySys_GetObjectId():
 0x00000000004adabe <+14>: mov rbx,QWORD PTR fs:0xfffffffffffffff8
_PyThreadState_GET() in PyThreadState_Get():
 0x000000000046b660 <+0>: mov rax,QWORD PTR fs:0xfffffffffffffff8
_PyInterpreterState_GET() in PyTuple_New():
 0x000000000048dfcc <+12>: mov rax,QWORD PTR fs:0xfffffffffffffff0
_PyInterpreterState_GET() in PyState_FindModule():
 0x000000000044bf20 <+16>: mov rax,QWORD PTR fs:0xfffffffffffffff0
---
Note: Without LTO, sometimes there is an indirection:
_PyThreadState_GET() in _PySys_GetObjectId(), 2 MOV (PIC indirection):
 mov rax,QWORD PTR [rip+0x1eb270] # 0x713fe0
 # rax = 0xfffffffffffffff0 (-16)
 mov r13,QWORD PTR fs:[rax]
_PyInterpreterState_GET() in PyTuple_New(), 2 MOV (PIC indirection):
 mov rax,QWORD PTR [rip+0x294d95] # 0x713ff8
 mov rax,QWORD PTR fs:[rax]
An optimized Python should always be built with LTO.
msg383900 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2020年12月28日 15:59
PR 23976 should make _PyInterpreterState_GET() more efficient, and it prepares the code base for one GIL per interpreter (which is purpose of this issue ;-)).
msg383940 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2020年12月28日 23:59
Mark Shannon experiment using __thread:
* https://mail.python.org/archives/list/python-dev@python.org/thread/RPSTDB6AEMIACJFZKCKIRFTVLAJQLAS2/
* https://github.com/python/cpython/compare/master...markshannon:threadstate_in_tls
He added " extern __thread struct _ts *_Py_tls_tstate;" to Include/cpython/pystate.h.
msg383967 - (view) Author: Pablo Galindo Salgado (pablogsal) * (Python committer) Date: 2020年12月29日 08:12
> An optimized Python should always be built with LTO.
In MacOS is quite challenging to activate LTO, so normally optimized builds are only done with PGO. Also in Windows I am not sure is possible to use LTO. Same for many other platforms
msg384057 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2020年12月30日 11:38
One GIL per interpreter requires to store the tstate per thread. I don't see any other option. We need to replace the global _PyRuntime atomic variable with a TLS variable. I'm trying to reduce the overhead, but it's heard to beat the performance of an atomic variable.
That's also we I modified many functions to pass explicitly tstate to subfunctions in internal C functions, to avoid any possible overhead of getting tstate.
https://vstinner.github.io/cpython-pass-tstate.html
Pablo:
> In MacOS is quite challenging to activate LTO, so normally optimized builds are only done with PGO.
Oh right, I forgot macOS. I should check how TLS is compiled on macOS. IMO wwo MOV instead of MOV is not a major performance bottleneck.
The best would be to be able to avoid pthread_getspecific() function which is less efficient than a TLS variable. The glibc implementation uses an array for a few variables (first 32 variables?) and then a slower hash table.
Pablo:
> Also in Windows I am not sure is possible to use LTO. Same for many other platforms.
I will check how it's implemented on Windows.
We cannot use TLS on all platforms, since it requires C11 features which are not available on all platforms. Also, the implementation depends on the architecture.
msg387312 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2021年02月19日 12:21
New changeset 62078101ea1be5d2fc472a3f0d9d135e0bd5cd38 by Victor Stinner in branch 'master':
bpo-40522: Replace PyThreadState_GET() with PyThreadState_Get() (GH-24575)
https://github.com/python/cpython/commit/62078101ea1be5d2fc472a3f0d9d135e0bd5cd38
History
Date User Action Args
2022年04月11日 14:59:30adminsetgithub: 84702
2021年06月29日 17:34:08h-vetinarisetnosy: + h-vetinari
2021年02月20日 06:24:30corona10setpull_requests: + pull_request23375
2021年02月20日 06:19:44corona10setpull_requests: + pull_request23374
2021年02月19日 12:21:54vstinnersetmessages: + msg387312
2021年02月19日 11:50:56vstinnersetpull_requests: + pull_request23354
2020年12月30日 16:57:41shihai1991setnosy: + shihai1991
2020年12月30日 16:46:04corona10setnosy: + corona10
2020年12月30日 11:38:22vstinnersetmessages: + msg384057
2020年12月29日 08:12:44pablogsalsetnosy: + pablogsal
messages: + msg383967
2020年12月28日 23:59:58vstinnersetmessages: + msg383940
2020年12月28日 15:59:22vstinnersetmessages: + msg383900
2020年12月28日 15:53:43vstinnersetmessages: + msg383899
2020年12月28日 15:32:27vstinnersetpull_requests: + pull_request22821
2020年12月28日 14:25:32vstinnersetmessages: + msg383895
2020年12月28日 14:21:31vstinnersetmessages: + msg383894
2020年11月05日 18:20:59sebergsetnosy: + seberg
2020年11月04日 13:58:15vstinnersetmessages: + msg380324
2020年05月15日 00:36:11vstinnersetcomponents: + Subinterpreters, - Interpreter Core
title: Subinterpreters: get the current Python interpreter state from Thread Local Storage (autoTSSkey) -> [subinterpreters] Get the current Python interpreter state from Thread Local Storage (autoTSSkey)
2020年05月05日 17:56:52vstinnersetmessages: + msg368188
2020年05月05日 17:19:49vstinnersetkeywords: + patch
stage: patch review
pull_requests: + pull_request19254
2020年05月05日 17:12:14vstinnercreate

AltStyle によって変換されたページ (->オリジナル) /