homepage

This issue tracker has been migrated to GitHub , and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: speed-up conversion between unicode widths
Type: performance Stage: resolved
Components: Interpreter Core Versions: Python 3.3
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: Nosy List: lemburg, loewis, meador.inge, pitrou, python-dev
Priority: normal Keywords: patch

Created on 2011年10月08日 22:18 by pitrou, last changed 2022年04月11日 14:57 by admin. This issue is now closed.

Files
File name Uploaded Description Edit
unroll_convert_bytes.patch pitrou, 2011年10月08日 22:18 review
unroll.c loewis, 2011年10月09日 03:06
Messages (8)
msg145192 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2011年10月08日 22:18
This patch speeds up _PyUnicode_CONVERT_BYTES by unrolling its loop.
Example micro-benchmark:
./python -m timeit -s "a='x'*10000;b='\u0102'*1000;c='\U00100000'" "a+b+c"
-> before:
100000 loops, best of 3: 14.9 usec per loop
-> after:
100000 loops, best of 3: 9.19 usec per loop
msg145193 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2011年10月08日 22:29
Antoine Pitrou wrote:
> 
> New submission from Antoine Pitrou <pitrou@free.fr>:
> 
> This patch speeds up _PyUnicode_CONVERT_BYTES by unrolling its loop.
> 
> Example micro-benchmark:
> 
> ./python -m timeit -s "a='x'*10000;b='\u0102'*1000;c='\U00100000'" "a+b+c"
> 
> -> before:
> 100000 loops, best of 3: 14.9 usec per loop
> -> after:
> 100000 loops, best of 3: 9.19 usec per loop
Before going further with this, I'd suggest you have a look at your
compiler settings. Such optimizations are normally performed by the
compiler and don't need to be implemented in C, making maintenance
harder.
The fact that Windows doesn't exhibit the same performance difference
suggests that the optimizer is not using the same level or feature
set as on Linux. MSVC is at least as good at optimizing code as gcc,
often better.
I tested using memchr() when writing those "naive" loops. It turned
out that using memchr() was slower than using the direct loops. memchr()
is inlined by the compiler just like the direct loop and the generated
code for the direct version is often easier to optimize for the compiler
than the memchr() one, since it receives more knowledge about the used
data types.
msg145194 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2011年10月08日 22:34
> Before going further with this, I'd suggest you have a look at your
> compiler settings.
They are set by the configure script:
gcc -pthread -c -Wno-unused-result -DNDEBUG -g -fwrapv -O3 -Wall
-Wstrict-prototypes -I. -I./Include -DPy_BUILD_CORE -o
Objects/unicodeobject.o Objects/unicodeobject.c
> Such optimizations are normally performed by the
> compiler and don't need to be implemented in C, making maintenance
> harder.
The fact that the glibc includes such optimization (in much more
sophisticated form) suggests to me that many compilers don't perform
these optimizations automically.
> I tested using memchr() when writing those "naive" loops.
memchr() is mentioned in another issue, #13134.
> memchr()
> is inlined by the compiler just like the direct loop
I don't think so. If you look at the glibc's memchr() implementation,
it's a sophisticated routine, not a trivial loop. Perhaps you're
thinking about memcpy().
> and the generated
> code for the direct version is often easier to optimize for the compiler
> than the memchr() one, since it receives more knowledge about the used
> data types.
?? Data types are fixed in the memchr() definition, there's no knowledge
to be gained by inlining.
msg145200 - (view) Author: Meador Inge (meador.inge) * (Python committer) Date: 2011年10月09日 00:49
On Sat, Oct 8, 2011 at 5:34 PM, Antoine Pitrou <report@bugs.python.org> wrote:
> Antoine Pitrou <pitrou@free.fr> added the comment:
>
>> Before going further with this, I'd suggest you have a look at your
>> compiler settings.
>
> They are set by the configure script:
>
> gcc -pthread -c -Wno-unused-result -DNDEBUG -g -fwrapv -O3 -Wall
> -Wstrict-prototypes  -I. -I./Include  -DPy_BUILD_CORE -o
> Objects/unicodeobject.o Objects/unicodeobject.c
>
>> Such optimizations are normally performed by the
>> compiler and don't need to be implemented in C, making maintenance
>> harder.
>
> The fact that the glibc includes such optimization (in much more
> sophisticated form) suggests to me that many compilers don't perform
> these optimizations automically.
I agree. This is more of an optimized runtime library problem than
code optimization problem.
>> I tested using memchr() when writing those "naive" loops.
>
> memchr() is mentioned in another issue, #13134.
Yeah, this conversation is really more relevant to issue13134, but I will
reply to these here anyway ....
>> memchr()
>> is inlined by the compiler just like the direct loop
>
> I don't think so. If you look at the glibc's memchr() implementation,
> it's a sophisticated routine, not a trivial loop. Perhaps you're
> thinking about memcpy().
Without link-time optimization enabled, I doubt the toolchain can
"inline" 'memchr'
in the traditional sense (i.e. inserting the body of the routine
inline). Even if it could,
the inline heuristics would most likely choose not to. I don't think we use LTO
with GCC, but I think we might with VC++.
GCC does something else called builtin folding that is more likely.
For example,
'memchr ("bca", 'c', 3)' gets replace with instructions to compute a pointer
index into "bca". This won't happen in this case because all of the 'memchr'
arguments are all variable.
>> and the generated
>> code for the direct version is often easier to optimize for the compiler
>> than the memchr() one, since it receives more knowledge about the used
>> data types.
>
> ?? Data types are fixed in the memchr() definition, there's no knowledge
> to be gained by inlining.
I think what Marc-Andre is alluding to is that the first parameter of
'memchr' is 'void *'
which could (in theory) limit optimization opportunities. Where as if
it knew that
the data being searched is a 'char *' or something it could take
advantage of that.
msg145205 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2011年10月09日 03:06
Marc-Andre: gcc will normally not unroll loops, unless -funroll-loops is given on the command line. Then, it will unroll many loops, and do so with 8 iterations per outer loop. This typically causes significant code bloat, which is why unrolling is normally disabled and left to the programmer.
For those who want to experiment with this, I attach a C file with just the code in question. Compile this with your favorite compiler settings, and see what the compile generates. clang, on an x64 system, compiles the original loop into
LBB0_2: ## =>This Inner Loop Header: Depth=1
 movzbl (%rdi), %eax
 movw %ax, (%rdx)
 incq %rdi
 addq 2,ドル %rdx
 decq %rsi
 jne LBB0_2
and the unrolled loop into
LBB1_2: ## %.lr.ph6
 ## =>This Inner Loop Header: Depth=1
 movzbl (%rdi,%rcx), %r8d
 movw %r8w, (%rdx)
 movzbl 1(%rdi,%rcx), %r8d
 movw %r8w, 2(%rdx)
 movzbl 2(%rdi,%rcx), %r8d
 movw %r8w, 4(%rdx)
 movzbl 3(%rdi,%rcx), %r8d
 movw %r8w, 6(%rdx)
 addq 8,ドル %rdx
 addq 4,ドル %rcx
 cmpq %rax, %rcx
 jl LBB1_2
msg145252 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2011年10月09日 11:33
Antoine Pitrou wrote:
> 
>> I tested using memchr() when writing those "naive" loops.
> 
> memchr() is mentioned in another issue, #13134.
Looks like I posted the comment to the wrong ticket.
msg145360 - (view) Author: Roundup Robot (python-dev) (Python triager) Date: 2011年10月11日 19:07
New changeset 5b077c962a16 by Antoine Pitrou in branch 'default':
Issue #13136: speed up conversion between different character widths.
http://hg.python.org/cpython/rev/5b077c962a16 
msg145361 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2011年10月11日 19:08
Patch committed. It is of course still not as fast as memcpy, but it's a small step towards improving performance.
History
Date User Action Args
2022年04月11日 14:57:22adminsetgithub: 57345
2011年10月11日 19:08:08pitrousetstatus: open -> closed
resolution: fixed
messages: + msg145361

stage: patch review -> resolved
2011年10月11日 19:07:13python-devsetnosy: + python-dev
messages: + msg145360
2011年10月09日 11:33:30lemburgsetmessages: + msg145252
2011年10月09日 03:06:10loewissetfiles: + unroll.c
nosy: + loewis
messages: + msg145205

2011年10月09日 00:49:51meador.ingesetnosy: + meador.inge
messages: + msg145200
2011年10月08日 22:34:50pitrousetmessages: + msg145194
2011年10月08日 22:29:12lemburgsetnosy: + lemburg
messages: + msg145193
2011年10月08日 22:18:30pitroucreate

AltStyle によって変換されたページ (->オリジナル) /