This issue tracker has been migrated to GitHub ,
and is currently read-only.
For more information,
see the GitHub FAQs in the Python's Developer Guide.
Created on 2012年05月07日 22:09 by vstinner, last changed 2022年04月11日 14:57 by admin. This issue is now closed.
| Files | ||||
|---|---|---|---|---|
| File name | Uploaded | Description | Edit | |
| format_writer.patch | vstinner, 2012年05月07日 22:09 | review | ||
| format_writer-2.patch | vstinner, 2012年05月08日 23:58 | review | ||
| issue14744-bench-1.py | serhiy.storchaka, 2012年05月09日 08:03 | |||
| inline_unicode_writer.patch | vstinner, 2012年05月09日 11:53 | review | ||
| dont_overallocate.patch | vstinner, 2012年05月13日 02:26 | review | ||
| benchmark.py | vstinner, 2012年05月13日 02:27 | |||
| REPORT_32BIT_2.7_3.2_writer | vstinner, 2012年05月23日 21:10 | |||
| REPORT_32BIT_3.3 | vstinner, 2012年05月23日 21:10 | |||
| REPORT_64BIT_2.7_3.2_writer | vstinner, 2012年05月23日 21:10 | |||
| REPORT_64BIT_3.3 | vstinner, 2012年05月23日 21:11 | |||
| REPORT_64BIT_PATCH | vstinner, 2012年05月23日 21:38 | |||
| faster-format.patch | vstinner, 2012年05月23日 21:50 | review | ||
| report_windows7 | vstinner, 2012年05月29日 22:35 | |||
| Repositories containing patches | |||
|---|---|---|---|
| http://hg.python.org/features/faster-format/ | |||
| Messages (37) | |||
|---|---|---|---|
| msg160176 - (view) | Author: STINNER Victor (vstinner) * (Python committer) | Date: 2012年05月07日 22:09 | |
Since 7be716a47e9d (issue #14716), str.format() uses the "unicode_writer" API. I propose to continue the work in this direction to avoid more temporary buffers. Python 3.3: 1000000 loops, best of 3: 0.573 usec per loop 100000 loops, best of 3: 16.4 usec per loop 1000000 loops, best of 3: 0.705 usec per loop 100000 loops, best of 3: 2.61 usec per loop Python 3.3 + patch (compared to Python 3.3): 1000000 loops, best of 3: 0.516 usec per loop (-10%) 100000 loops, best of 3: 13.2 usec per loop (-20%) 1000000 loops, best of 3: 0.574 usec per loop (-18%) 100000 loops, best of 3: 2.59 usec per loop (-1%) -- If this patch is accepted, it's more to go even deeper: use _PyUnicodeWriter in long_to_decimal_string() for example. -- Benchmark Python 3 / Python 2 bytes: python -m timeit -s 'fmt="{0}.{1}.{2}"' 'fmt.format("http", "client", "HTTPConnection")' python -m timeit -s 'fmt="{0:s}"*100' 'fmt.format("ABCDEF")' python -m timeit -s 'fmt=" [line {0:2d}] "' 'fmt.format(5)' python -m timeit -s 'fmt="x={} y={} z={}"' 'fmt.format(12345, 12.345, 12.345+2j)' Benchmark Python 2 unicode: python -m timeit -s 'fmt=u"{0}.{1}.{2}"' 'fmt.format(u"http", u"client", u"HTTPConnection")' python -m timeit -s 'fmt=u"{0:s}"*100' 'fmt.format(u"ABCDEF")' python -m timeit -s 'fmt=u" [line {0:2d}] "' 'fmt.format(5)' python -m timeit -s 'fmt=u"x={} y={} z={}"' 'fmt.format(12345, 12.345, 12.345+2j)' Python 2.7 bytes: 1000000 loops, best of 3: 0.393 usec per loop 100000 loops, best of 3: 9.72 usec per loop 1000000 loops, best of 3: 0.337 usec per loop 1000000 loops, best of 3: 1.56 usec per loop Python 2.7 wide: 1000000 loops, best of 3: 0.443 usec per loop 100000 loops, best of 3: 10.3 usec per loop 1000000 loops, best of 3: 0.785 usec per loop 100000 loops, best of 3: 2.48 usec per loop Python 3.2 wide: 1000000 loops, best of 3: 0.457 usec per loop 100000 loops, best of 3: 10.5 usec per loop 1000000 loops, best of 3: 0.538 usec per loop 100000 loops, best of 3: 2.36 usec per loop |
|||
| msg160177 - (view) | Author: STINNER Victor (vstinner) * (Python committer) | Date: 2012年05月07日 22:19 | |
Comments on the patch. -PyAPI_FUNC(PyObject *) _PyComplex_FormatAdvanced(PyObject *obj, +PyAPI_FUNC(int) _PyComplex_FormatWriter(PyObject *obj, Even if it is a private function, I prefer to rename it because its API does change. /* Use the inlined version in unicodeobject.c */ #define _PyUnicodeWriter_prepare _PyUnicodeWriter_prepare_inline #define _PyUnicodeWriter_write_substr _PyUnicodeWriter_write_substr_inline #define _PyUnicodeWriter_write_str _PyUnicodeWriter_write_str_inline #define _PyUnicodeWriter_write_char _PyUnicodeWriter_write_char_inline Inlining may be removed to simplify the code (but inlining does speed up the code a little bit). Or the opposite: this code should be moved to a new "unicodewriterinline.h" file which would be included by unicodeobject.c and formatter_unicode.c. |
|||
| msg160186 - (view) | Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) | Date: 2012年05月08日 05:32 | |
> If this patch is accepted, it's more to go even deeper: use _PyUnicodeWriter in long_to_decimal_string() for example. long_to_decimal_string() is already creates a string of known size. How _PyUnicodeWriter can help here? Issue3451 looks much more promising for int formatting. But it will take a lot of time to carefully check this. |
|||
| msg160189 - (view) | Author: STINNER Victor (vstinner) * (Python committer) | Date: 2012年05月08日 08:37 | |
_PyUnicodeWriter in long_to_decimal_string() for example.
>
> long_to_decimal_string() is already creates a string of known size. How
> _PyUnicodeWriter can help here?
"x={}".format(123) uses a temporary buffer for "123". Using
_PyUnicodeWriter even to format 123 would avoid a malloc() and a copy of
the characters.
|
|||
| msg160190 - (view) | Author: Mark Dickinson (mark.dickinson) * (Python committer) | Date: 2012年05月08日 09:02 | |
> Issue3451 looks much more promising for int formatting. But it will take > a lot of time to carefully check this. I disagree: Issue 3451 is about *asymptotically* fast base conversion, and the changes proposed there are only going to kick in for numbers with hundreds of digits; it's not going to affect the common case at all. |
|||
| msg160194 - (view) | Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) | Date: 2012年05月08日 10:38 | |
> "x={}".format(123) uses a temporary buffer for "123".
This, apparently, is inevitable. I doubt that it is able to considerably
optimize, not worsened str(int) (which is optimal for current
algorithm). Note that the more complex formatting (with width) will
still require the temporary buffer.
Be very careful not to cause regress.
> Using
> _PyUnicodeWriter even to format 123 would avoid a malloc() and a copy of
> the characters.
Fill the ascii buffer and then copying can be cheaper than using
_PyUnicodeWriter with general non-ascii string.
|
|||
| msg160236 - (view) | Author: STINNER Victor (vstinner) * (Python committer) | Date: 2012年05月08日 23:58 | |
> Fill the ascii buffer and then copying can be cheaper than using
> _PyUnicodeWriter with general non-ascii string.
Here is a new patch using _PyUnicodeWriter directly in longobject.c.
According to my benchmark (see below), formating a small number (5 decimal digits) is 17% faster with my patch version 2 compared to tip, and 38% faster compared to Python 3.3 before my optimizations on str%tuples or str.format(). Creating a temporary PyUnicode is not cheap, at least for short strings.
str%tuple and str.format() allocates len(format_string)+100 ASCII characters at the beginning, which is enough for "x={}".format(12345) for example. So only a resize is needed, and it looks like resizing is cheap.
I'm not completly satisfied of the usage of Py_LOCAL_INLINE in unicodeobject.c for _PyUnicodeWriter methods. The same "hacks" (?) should be used in formatter_unicode.c.
Shell script (bench.sh) used to benchmark:
--------
echo -n "{0}.{1}.{2}: "; ./python -m timeit -r 10 -s 'fmt="{0}.{1}.{2}"' 'fmt.format("http", "client", "HTTPConnection")'
echo -n " [line {0:2d}] : "; ./python -m timeit -r 10 -s 'fmt=" [line {0:2d}] "' 'fmt.format(5)'
echo -n "str: "; ./python -m timeit -r 10 -s 'fmt="{0}"*100' 'fmt.format("ABCDEF")'
echo -n "str conv: "; ./python -m timeit -r 10 -s 'fmt="{0:s}"*100' 'fmt.format("ABCDEF")'
echo -n "long x 3: "; ./python -m timeit -r 10 -s 'fmt="x={0} x={0} x={0}"' 'fmt.format(12345)'
echo -n "float x 3: "; ./python -m timeit -r 10 -s 'fmt="x={0} x={0} x={0}"' 'fmt.format(12.345)'
echo -n "complex x 3: "; ./python -m timeit -r 10 -s 'fmt="x={0} x={0} x={0}"' 'fmt.format(12.345+2j)'
echo -n "long, float, complex: "; ./python -m timeit -r 10 -s 'fmt="x={} y={} z={}"' 'fmt.format(12345, 12.345, 12.345+2j)'
echo -n "huge long: "; ./python -m timeit -r 10 -s 'import math; huge=math.factorial(2000); fmt="x={}"' 'fmt.format(huge)'
--------
Results:
--------
3.3:
{0}.{1}.{2}: 1000000 loops, best of 10: 0.394 usec per loop
[line {0:2d}] : 1000000 loops, best of 10: 0.519 usec per loop
str: 100000 loops, best of 10: 7.01 usec per loop
str conv: 100000 loops, best of 10: 13.3 usec per loop
long x 3: 1000000 loops, best of 10: 0.569 usec per loop
float x 3: 1000000 loops, best of 10: 1.62 usec per loop
complex x 3: 100000 loops, best of 10: 3.34 usec per loop
long, float, complex: 100000 loops, best of 10: 2.08 usec per loop
huge long: 1000 loops, best of 10: 666 usec per loop
3.3 + format_writer.patch :
{0}.{1}.{2}: 1000000 loops, best of 10: 0.412 usec per loop (+5%)
[line {0:2d}] : 1000000 loops, best of 10: 0.461 usec per loop (-11%)
str: 100000 loops, best of 10: 6.85 usec per loop (-2%)
str conv: 100000 loops, best of 10: 11.1 usec per loop (-17%)
long x 3: 1000000 loops, best of 10: 0.605 usec per loop (+6%)
float x 3: 1000000 loops, best of 10: 1.57 usec per loop (-3%)
complex x 3: 100000 loops, best of 10: 3.54 usec per loop (+6%)
long, float, complex: 100000 loops, best of 10: 2.19 usec per loop (+5%)
huge long: 1000 loops, best of 10: 665 usec per loop (0%)
3.3 + format_writer-2.patch :
{0}.{1}.{2}: 1000000 loops, best of 10: 0.378 usec per loop (-4%)
[line {0:2d}] : 1000000 loops, best of 10: 0.454 usec per loop (-13%)
str: 100000 loops, best of 10: 6.18 usec per loop (-12%)
str conv: 100000 loops, best of 10: 10.9 usec per loop (-18%)
long x 3: 1000000 loops, best of 10: 0.471 usec per loop (-17%)
float x 3: 1000000 loops, best of 10: 1.37 usec per loop (-15%)
complex x 3: 100000 loops, best of 10: 3.4 usec per loop (+2%)
long, float, complex: 1000000 loops, best of 10: 1.93 usec per loop (-7%)
huge long: 1000 loops, best of 10: 665 usec per loop (0%)
--------
|
|||
| msg160244 - (view) | Author: Antoine Pitrou (pitrou) * (Python committer) | Date: 2012年05月09日 05:38 | |
> According to my benchmark (see below), formating a small number (5 > decimal digits) is 17% faster with my patch version 2 compared to tip, > and 38% faster compared to Python 3.3 before my optimizations on str% > tuples or str.format(). Creating a temporary PyUnicode is not cheap, > at least for short strings. A 17% improvement on a micro-benchmark is not much. There will probably be no visible difference in real-world code. Also, if creating temporary PyUnicodes is not cheap, perhaps we could have a freelist for them? |
|||
| msg160259 - (view) | Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) | Date: 2012年05月09日 08:03 | |
> Here is a new patch using _PyUnicodeWriter directly in longobject.c.
It may be worth to do it in a separate issue?
decimal digits) is 17% faster with my patch version 2 compared to tip,
and 38% faster compared to Python 3.3 before my optimizations on str%
tuples or str.format(). Creating a temporary PyUnicode is not cheap, at
least for short strings.
Here is my benchmark script (attached) and the results:
Python 3.3 (vanilla):
1.65 str(12345)
2.6 '{}'.format(12345)
2.69 'A{}'.format(12345)
3.16 '\x80{}'.format(12345)
3.23 '\u0100{}'.format(12345)
3.32 '\U00010000{}'.format(12345)
4.6 '{:-10}'.format(12345)
4.89 'A{:-10}'.format(12345)
5.53 '\x80{:-10}'.format(12345)
5.71 '\u0100{:-10}'.format(12345)
5.63 '\U00010000{:-10}'.format(12345)
4.6 '{:,}'.format(12345)
4.71 'A{:,}'.format(12345)
5.28 '\x80{:,}'.format(12345)
5.65 '\u0100{:,}'.format(12345)
5.59 '\U00010000{:,}'.format(12345)
Python 3.3 + format_writer.patch:
1.72 str(12345)
2.74 '{}'.format(12345)
2.99 'A{}'.format(12345)
3.4 '\x80{}'.format(12345)
3.52 '\u0100{}'.format(12345)
3.51 '\U00010000{}'.format(12345)
4.24 '{:-10}'.format(12345)
4.6 'A{:-10}'.format(12345)
5.16 '\x80{:-10}'.format(12345)
6.87 '\u0100{:-10}'.format(12345)
6.83 '\U00010000{:-10}'.format(12345)
4.12 '{:,}'.format(12345)
4.6 'A{:,}'.format(12345)
5.09 '\x80{:,}'.format(12345)
6.63 '\u0100{:,}'.format(12345)
6.42 '\U00010000{:,}'.format(12345)
Python 3.3 + format_writer-2.patch:
1.91 str(12345)
2.44 '{}'.format(12345)
2.61 'A{}'.format(12345)
3.08 '\x80{}'.format(12345)
3.31 '\u0100{}'.format(12345)
3.13 '\U00010000{}'.format(12345)
4.57 '{:-10}'.format(12345)
4.96 'A{:-10}'.format(12345)
5.52 '\x80{:-10}'.format(12345)
7.01 '\u0100{:-10}'.format(12345)
7.34 '\U00010000{:-10}'.format(12345)
4.42 '{:,}'.format(12345)
4.76 'A{:,}'.format(12345)
5.16 '\x80{:,}'.format(12345)
7.2 '\u0100{:,}'.format(12345)
6.74 '\U00010000{:,}'.format(12345)
As you can see, there is a regress, and sometimes it is not less than
improvement.
|
|||
| msg160277 - (view) | Author: STINNER Victor (vstinner) * (Python committer) | Date: 2012年05月09日 11:53 | |
> Inlining may be removed to simplify the code Attached inline_unicode_writer.patch does inline the code but also call only unicode_writer_prepare() once for each argument in PyUnicode_Format(). The patch removes unicode_writer_write_char() and unicode_writer_write_str() which had an overhead for the following patches (format_writer.patch, format_writer-2.patch). > As you can see, there is a regress, and sometimes > it is not less than improvement. Oh yes, thanks for your benchmark. I will analyze why my patch slows down these cases and try to update my patch to be applicable after inline_unicode_writer.patch. I suppose that the _PyUnicodeWriter API has a little overhead which is seen by microbenchmarks. |
|||
| msg160313 - (view) | Author: Roundup Robot (python-dev) (Python triager) | Date: 2012年05月09日 20:32 | |
New changeset 6c8a117f8966 by Victor Stinner in branch 'default': Issue #14744: Inline unicode_writer_write_char() and unicode_write_str() http://hg.python.org/cpython/rev/6c8a117f8966 |
|||
| msg160503 - (view) | Author: STINNER Victor (vstinner) * (Python committer) | Date: 2012年05月13日 02:26 | |
To prepare a deeper change, here is a first simple patch. Change how the size of the _PyUnicodeWriter buffer is handled:
* overallocate by 100% (instead of 25%) and allocate at least 100 characters
* don't overallocate when possible
It is possible to not overallocate when the format string has no prefix nor suffix and when there is only one argument. For example, "%s" % "abc" does directly allocate the exact buffer size and so don't need to call realloc() at the end.
The patch does also micro-optimize (cleanup?) _PyUnicodeWriter_Finish.
When it's possible to not overallocate, the speed up is around 10% for short strings (I suppose that it's much better to longer strings).
--
Output of the attached benchmark.py script:
Linux-3.3.2-1.fc16.x86_64-x86_64-with-fedora-16-Verne
Intel(R) Core(TM) i7-2600 CPU @ 3.40GHz
=== 3.3 ===
16 ms: N=200; fmt="{0}-"*N; arg="abc"; fmt.format(arg)
5.77 ms: N=200; L=3; fmt="%s"*N; args=("a"*L,)*N; fmt % args
648 ns: s="The {k1} is {k2} the {k3}."; args={"k1": "x", "k2": "y", "k3": "z"}; s.format(**args)
427 ns: s="The %(k1)s is %(k2)s the %(k3)s."; args={"k1":"x","k2":"y","k3":"z",}; s%args
531 ns: s="x={}, y={}, z={}"; args=(123, 456, 789); s.format(*args)
453 ns: s="x=%s, y=%u, z=%x"; args=(123, 456, 789); s%args
180 ns: fmt="{}"; arg="abc"; fmt.format(arg)
228 ns: fmt="{}"; arg=123; fmt.format(arg)
196 ns: fmt="x={}"; arg="abc"; fmt.format(arg)
244 ns: fmt="x={}"; arg=123; fmt.format(arg)
175 ns: str(12345)
233 ns: '{}'.format(12345)
243 ns: 'A{}'.format(12345)
276 ns: '\x80{}'.format(12345)
292 ns: '\u0100{}'.format(12345)
285 ns: '\U00010000{}'.format(12345)
417 ns: '{:-10}'.format(12345)
425 ns: 'A{:-10}'.format(12345)
461 ns: '\x80{:-10}'.format(12345)
488 ns: '\u0100{:-10}'.format(12345)
461 ns: '\U00010000{:-10}'.format(12345)
403 ns: '{:,}'.format(12345)
416 ns: 'A{:,}'.format(12345)
455 ns: '\x80{:,}'.format(12345)
469 ns: '\u0100{:,}'.format(12345)
448 ns: '\U00010000{:,}'.format(12345)
108 ns: fmt="%s"; arg="abc"; fmt % arg
163 ns: fmt="%s"; arg=123; fmt % arg
121 ns: fmt="%s:"; arg="abc"; fmt % arg
=== 3.3 + dont_overallocate.patch ===
15.5 ms: N=200; fmt="{0}-"*N; arg="abc"; fmt.format(arg)
6 ms: N=200; L=3; fmt="%s"*N; args=("a"*L,)*N; fmt % args
599 ns: s="The {k1} is {k2} the {k3}."; args={"k1": "x", "k2": "y", "k3": "z"}; s.format(**args)
424 ns: s="The %(k1)s is %(k2)s the %(k3)s."; args={"k1":"x","k2":"y","k3":"z",}; s%args
531 ns: s="x={}, y={}, z={}"; args=(123, 456, 789); s.format(*args)
441 ns: s="x=%s, y=%u, z=%x"; args=(123, 456, 789); s%args
155 ns: fmt="{}"; arg="abc"; fmt.format(arg)
206 ns: fmt="{}"; arg=123; fmt.format(arg)
204 ns: fmt="x={}"; arg="abc"; fmt.format(arg)
247 ns: fmt="x={}"; arg=123; fmt.format(arg)
172 ns: str(12345)
209 ns: '{}'.format(12345)
248 ns: 'A{}'.format(12345)
257 ns: '\x80{}'.format(12345)
265 ns: '\u0100{}'.format(12345)
263 ns: '\U00010000{}'.format(12345)
354 ns: '{:-10}'.format(12345)
404 ns: 'A{:-10}'.format(12345)
415 ns: '\x80{:-10}'.format(12345)
457 ns: '\u0100{:-10}'.format(12345)
430 ns: '\U00010000{:-10}'.format(12345)
369 ns: '{:,}'.format(12345)
418 ns: 'A{:,}'.format(12345)
437 ns: '\x80{:,}'.format(12345)
459 ns: '\u0100{:,}'.format(12345)
448 ns: '\U00010000{:,}'.format(12345)
76 ns: fmt="%s"; arg="abc"; fmt % arg
133 ns: fmt="%s"; arg=123; fmt % arg
118 ns: fmt="%s:"; arg="abc"; fmt % arg
|
|||
| msg160513 - (view) | Author: Antoine Pitrou (pitrou) * (Python committer) | Date: 2012年05月13日 09:56 | |
> When it's possible to not overallocate, the speed up is around 10% for > short strings (I suppose that it's much better to longer strings). Well, please post a benchmark for long strings, then :-) I think 10% on a micro-benchmark is not worth the complication. This code is already complicated enough. |
|||
| msg160565 - (view) | Author: STINNER Victor (vstinner) * (Python committer) | Date: 2012年05月13日 20:10 | |
>> When it's possible to not overallocate, the speed up is around 10% for >> short strings (I suppose that it's much better to longer strings). > Well, please post a benchmark for long strings, then :-) Ok, here you have. I don't understand why results are so random. There is no performance regression, but there are interesting speed up. Original: 200 ns: L=10 ** 3; fmt="%s"; args="a" * L; fmt % args 370 ns: L=10 ** 4; fmt="%s"; args="a" * L; fmt % args 3.85 ms: L=10 ** 5; fmt="%s"; args="a" * L; fmt % args 334 ms: L=10 ** 6; fmt="%s"; args="a" * L; fmt % args 2.87 ms: L=10 ** 7; fmt="%s"; args="a" * L; fmt % args 23.9 ms: L=10 ** 8; fmt="%s"; args="a" * L; fmt % args Patched: 120 ns (-40%): L=10 ** 3; fmt="%s"; args="a" * L; fmt % args 276 ns (-25%): L=10 ** 4; fmt="%s"; args="a" * L; fmt % args 3.7 ms (-4%): L=10 ** 5; fmt="%s"; args="a" * L; fmt % args 67.6 ms (-80%): L=10 ** 6; fmt="%s"; args="a" * L; fmt % args 1.38 ms (-51%): L=10 ** 7; fmt="%s"; args="a" * L; fmt % args 23.5 ms (-2%): L=10 ** 8; fmt="%s"; args="a" * L; fmt % args |
|||
| msg160567 - (view) | Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) | Date: 2012年05月13日 20:14 | |
Not quite honest contrexample:
./python -m timeit -s "f='[{}]'.format;s='A'*100" "f(s)"
Python 3.3: 1000000 loops, best of 3: 1.67 usec per loop
Python 3.3 + dont_overallocate.patch: 100000 loops, best of 3: 2.01 usec per loop
|
|||
| msg160568 - (view) | Author: Antoine Pitrou (pitrou) * (Python committer) | Date: 2012年05月13日 20:15 | |
> >> When it's possible to not overallocate, the speed up is around 10% for > >> short strings (I suppose that it's much better to longer strings). > > Well, please post a benchmark for long strings, then :-) > > Ok, here you have. I don't understand why results are so random. There > is no performance regression, but there are interesting speed up. Do you have anything more interesting than fmt="%s" ? |
|||
| msg160574 - (view) | Author: STINNER Victor (vstinner) * (Python committer) | Date: 2012年05月13日 20:41 | |
> Not quite honest contrexample I agree, this example is not honest :-) It's because of the magical value 100 used as initial size of the buffer. The speed is the same for shorter or longer strings. |
|||
| msg160578 - (view) | Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) | Date: 2012年05月13日 21:18 | |
It seems to me that the proposed changes are too tricky and too dirty for such a modest gain. It seems to me, this effect can be achieved easier (special-casing "%s" and "{}" to return str(arg)?).
If you want to get really impressive results, try to compile a pattern into an optimized internal representation and cache it (as in the struct module). This completely different approach may work.
|
|||
| msg160579 - (view) | Author: STINNER Victor (vstinner) * (Python committer) | Date: 2012年05月13日 21:26 | |
> Do you have anything more interesting than fmt="%s" ? and > It seems to me that the proposed changes are too tricky and too dirty for such a modest gain. To be honest, I didn't write dont_overallocate.patch to speed up formatting strings, but it's a patch to prepare another change on str.format. The patch limits the overhead of _PyUnicodeWriter for str.format. I proposed the change as a separated patch because I prefer simple patches: they are easier to review and to benchmark. But I agree the speed up is less impressive :-) I also prefer small patches for practical reasons. It's not easy to exchange huge patches on the tracker and to manipulate them. I always hesitate to start a new HG repository for a specific issue. I will rewrite my format_writer-2.patch based on dont_overallocate.patch. It looks like you are waiting for the full patch. |
|||
| msg160581 - (view) | Author: Antoine Pitrou (pitrou) * (Python committer) | Date: 2012年05月13日 21:32 | |
> I will rewrite my format_writer-2.patch based on > dont_overallocate.patch. It looks like you are waiting for the full > patch. Well, there's no point in committing the first patch if the second one doesn't give an interesting speedup. |
|||
| msg160610 - (view) | Author: STINNER Victor (vstinner) * (Python committer) | Date: 2012年05月14日 12:05 | |
I created a new repository to optimize str.format and str%args. |
|||
| msg161459 - (view) | Author: STINNER Victor (vstinner) * (Python committer) | Date: 2012年05月23日 21:20 | |
Because I don't know what should be tested, I wrote a lot a tests in the bench_str.py script. To run the benchmark, use: ./python benchmark.py --file=FILE script bench_str.py Then to compare results: ./python benchmark.py compare_to FILE1 FILE2 FILE3 ... Download scripts from: https://bitbucket.org/haypo/misc/src/tip/python/benchmark.py https://bitbucket.org/haypo/misc/src/tip/python/bench_str.py (I wrote the benchmark.py tool for this issue, it's a a generic tool to create and compare benchmarks.) I attached results on 32 and 64 bits. * Python 2.7 vs 3.2 vs 3.3 (faster-format branch): REPORT_32BIT_2.7_3.2_writer and REPORT_64BIT_2.7_3.2_writer * Python 3.3, UCS4 buffer vs PyAccu vs _PyUnicodeWriter: REPORT_32BIT_3.3 and REPORT_64BIT_3.3 |
|||
| msg161460 - (view) | Author: Antoine Pitrou (pitrou) * (Python committer) | Date: 2012年05月23日 21:26 | |
When posting benchmark numbers, can you please only compared patched against unpatched? I don't think we care about performance compared to 3.2 or 2.7 here, and it would make things more readable. |
|||
| msg161461 - (view) | Author: STINNER Victor (vstinner) * (Python committer) | Date: 2012年05月23日 21:34 | |
For Python 3.3, _PyUnicodeWriter API is faster than the Py_UCS4 buffer API and PyAccu API in quite all cases, with a speedup between 30% and 100%. But there are some cases where the _PyUnicodeWriter API is slower:
fmt="x={}"; arg=12.345; fmt.format(arg)
fmt="{}:"; arg=12.345; fmt.format(arg)
fmt="x=%s"; arg="\u20ac" * 3; fmt % arg
fmt="%s:"; arg="abc"; fmt % arg
fmt="%s:"; arg="\u20ac" * 3; fmt % arg
fmt="\u20ac[%s]"; arg="abc"; fmt % arg
fmt="\u20ac[%s]"; arg="\u20ac" * 3; fmt % arg
fmt="\u20ac[%s]"; arg=12.345; fmt % arg
fmt="\u20ac[%s]"; arg=2j; fmt % arg
|
|||
| msg161462 - (view) | Author: STINNER Victor (vstinner) * (Python committer) | Date: 2012年05月23日 21:38 | |
> When posting benchmark numbers, can you please only compared > patched against unpatched? Here you have: REPORT_64BIT_PATCH. |
|||
| msg161464 - (view) | Author: STINNER Victor (vstinner) * (Python committer) | Date: 2012年05月23日 21:50 | |
faster-format.patch: Patch for Python 3.3 optimizing str%args and str.format(args), use _PyUnicodeWriter deeper in formatting. The patch uses different optimizations:
* if the result is just a string, copy the string by reference, don't copy it by value. It's not something new, this optimization was already used by the PyAccu API. Examples:
- "{}".format(str)
- "%s".format(str)
* avoid a temporary buffer to format integers (base 2, 8, 10, 16). Examples:
- "decimal=%s".format(int)
- "hex=%x".format(int)
- "%o".format(int)
- "{}".format(int)
- "{:x}".format(int)
* don't overallocate the last argument of a format string. Example:
- "x=%s".format("A" * 4096)
|
|||
| msg161484 - (view) | Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) | Date: 2012年05月24日 07:59 | |
> For Python 3.3, _PyUnicodeWriter API is faster than the Py_UCS4 buffer API and PyAccu API in quite all cases, with a speedup between 30% and 100%. But there are some cases where the _PyUnicodeWriter API is slower: Perhaps most of these problems can be solved if instead of the boolean flag (overallocate/no overallocate) to use the Py_ssize_t parameter that indicates by how much should you overallocate (it is the length of the suffix in the format). |
|||
| msg161494 - (view) | Author: STINNER Victor (vstinner) * (Python committer) | Date: 2012年05月24日 09:57 | |
>> For Python 3.3, _PyUnicodeWriter API is faster than the Py_UCS4 buffer API and PyAccu API in quite all cases, with a speedup between 30% and 100%. But there are some cases where the _PyUnicodeWriter API is slower: > > Perhaps most of these problems can be solved if instead of the boolean > flag (overallocate/no overallocate) to use the Py_ssize_t parameter that > indicates by how much should you overallocate (it is the length of the > suffix in the format). There is not only a flag (flags.overallocate): there is also the min_length, which is used and helps for str%args and str.format(args). My patch contains a lot of "tricks" to limit overallocation, e.g. don't overallocate if we are writing the last part of the output. Computing exactly the size of the buffer gives the best performance because it avoids a resize in _PyUnicodeWriter_Finish(). I tried for example to modify PyUnicode_Format() to parse the format string twice: first to compute the size of the output buffer, second to write characters. In my experience, parsing the format string twice is more expensive than reallocating the buffer (PyUnicode_READ is expensive), especially on short and simple format strings. I tried different methods to allocate the buffer of _PyUnicodeWriter: change the overallocation factor (0%, 25%, 50%, 100%), only overallocate +100 characters, etc. But I failed to find something better than the proposed patch. At least I can say than always disabling overallocation slows done many cases: when there is a suffix after an argument, or when there are more than one argument. Feel free to experiment other methods to estimate the size of the output buffer. |
|||
| msg161784 - (view) | Author: STINNER Victor (vstinner) * (Python committer) | Date: 2012年05月28日 14:01 | |
So, do you have any comment or complain? Or can I commit the patch? Le 24 mai 2012 11:57, "STINNER Victor" <report@bugs.python.org> a écrit : > > STINNER Victor <victor.stinner@gmail.com> added the comment: > > >> For Python 3.3, _PyUnicodeWriter API is faster than the Py_UCS4 buffer > API and PyAccu API in quite all cases, with a speedup between 30% and 100%. > But there are some cases where the _PyUnicodeWriter API is slower: > > > > Perhaps most of these problems can be solved if instead of the boolean > > flag (overallocate/no overallocate) to use the Py_ssize_t parameter that > > indicates by how much should you overallocate (it is the length of the > > suffix in the format). > > There is not only a flag (flags.overallocate): there is also the > min_length, which is used and helps for str%args and str.format(args). > > My patch contains a lot of "tricks" to limit overallocation, e.g. > don't overallocate if we are writing the last part of the output. > > Computing exactly the size of the buffer gives the best performance > because it avoids a resize in _PyUnicodeWriter_Finish(). I tried for > example to modify PyUnicode_Format() to parse the format string twice: > first to compute the size of the output buffer, second to write > characters. In my experience, parsing the format string twice is more > expensive than reallocating the buffer (PyUnicode_READ is expensive), > especially on short and simple format strings. > > I tried different methods to allocate the buffer of _PyUnicodeWriter: > change the overallocation factor (0%, 25%, 50%, 100%), only > overallocate +100 characters, etc. But I failed to find something > better than the proposed patch. > > At least I can say than always disabling overallocation slows done > many cases: when there is a suffix after an argument, or when there > are more than one argument. > > Feel free to experiment other methods to estimate the size of the output > buffer. > > ---------- > > _______________________________________ > Python tracker <report@bugs.python.org> > <http://bugs.python.org/issue14744> > _______________________________________ > |
|||
| msg161789 - (view) | Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) | Date: 2012年05月28日 15:22 | |
> So, do you have any comment or complain? Or can I commit the patch? I beg your pardon, I will do a review and additional benchmarks today. So far away I have to say, it is better to use stringlib approach, than the massive macros, which are more difficult to read and edit. However, I will do a benchmark to check if we can achieve the same effect with less change code. |
|||
| msg161810 - (view) | Author: STINNER Victor (vstinner) * (Python committer) | Date: 2012年05月28日 21:44 | |
> So far away I have to say, it is better to use stringlib > approach, than the massive macros, which are more difficult > to read and edit. Ah, you don't like the two macros in longobject.c. Functions to write digits into a string may be appropriate in the stringlib. |
|||
| msg161813 - (view) | Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) | Date: 2012年05月28日 22:13 | |
I just sent you a patch which does not use any macros or stringlib. |
|||
| msg161814 - (view) | Author: STINNER Victor (vstinner) * (Python committer) | Date: 2012年05月28日 23:00 | |
> Functions to write digits into a string may be appropriate > in the stringlib. Oh, stringlib is specific to unicodeobject.c: it cannot be used outside. |
|||
| msg161870 - (view) | Author: Roundup Robot (python-dev) (Python triager) | Date: 2012年05月29日 11:13 | |
New changeset 22b56b0b8619 by Victor Stinner in branch 'default': Issue #14744: Use the new _PyUnicodeWriter internal API to speed up str%args and str.format(args) http://hg.python.org/cpython/rev/22b56b0b8619 |
|||
| msg161899 - (view) | Author: Roundup Robot (python-dev) (Python triager) | Date: 2012年05月29日 16:52 | |
New changeset 6abab1a103a6 by Victor Stinner in branch 'default': Issue #14744: Fix compilation on Windows http://hg.python.org/cpython/rev/6abab1a103a6 |
|||
| msg161900 - (view) | Author: Roundup Robot (python-dev) (Python triager) | Date: 2012年05月29日 16:54 | |
New changeset df0144f68d76 by Victor Stinner in branch 'default': Issue #14744: Fix compilation on Windows (part 2) http://hg.python.org/cpython/rev/df0144f68d76 |
|||
| msg161916 - (view) | Author: STINNER Victor (vstinner) * (Python committer) | Date: 2012年05月29日 22:35 | |
report_windows7: Comparaison of str%args and str.format() on Windows 7. * Python 2.7 (64 bits) * Python 3.2 (64 bits), narrow (UTF-16) * Python 3.3 (*32* bits), PEP 393 The benchmark is not fair because Python 3.3 is compiled in 32 bits, but there are interesting results. Example: ("{0}"*1024).format("A" * 4096) takes 1 or 2 seconds with Python 2.7/3.2 and 0.008 sec with Python 3.3! |
|||
| History | |||
|---|---|---|---|
| Date | User | Action | Args |
| 2022年04月11日 14:57:29 | admin | set | github: 58949 |
| 2012年05月29日 23:15:40 | vstinner | set | status: open -> closed resolution: fixed |
| 2012年05月29日 22:35:18 | vstinner | set | files:
+ report_windows7 messages: + msg161916 |
| 2012年05月29日 16:54:02 | python-dev | set | messages: + msg161900 |
| 2012年05月29日 16:52:11 | python-dev | set | messages: + msg161899 |
| 2012年05月29日 11:13:24 | python-dev | set | messages: + msg161870 |
| 2012年05月28日 23:00:02 | vstinner | set | messages: + msg161814 |
| 2012年05月28日 22:13:29 | serhiy.storchaka | set | messages: + msg161813 |
| 2012年05月28日 21:44:37 | vstinner | set | messages: + msg161810 |
| 2012年05月28日 15:22:12 | serhiy.storchaka | set | messages: + msg161789 |
| 2012年05月28日 14:01:35 | vstinner | set | messages: + msg161784 |
| 2012年05月24日 09:57:29 | vstinner | set | messages: + msg161494 |
| 2012年05月24日 07:59:01 | serhiy.storchaka | set | messages: + msg161484 |
| 2012年05月23日 21:50:47 | vstinner | set | files:
+ faster-format.patch messages: + msg161464 |
| 2012年05月23日 21:48:00 | vstinner | set | files: - faa88c50a3d2.diff |
| 2012年05月23日 21:38:06 | vstinner | set | files:
+ REPORT_64BIT_PATCH messages: + msg161462 |
| 2012年05月23日 21:34:59 | vstinner | set | messages: + msg161461 |
| 2012年05月23日 21:26:37 | pitrou | set | messages: + msg161460 |
| 2012年05月23日 21:20:35 | vstinner | set | messages: + msg161459 |
| 2012年05月23日 21:12:47 | vstinner | set | files: + faa88c50a3d2.diff |
| 2012年05月23日 21:11:05 | vstinner | set | files: + REPORT_64BIT_3.3 |
| 2012年05月23日 21:10:54 | vstinner | set | files: + REPORT_64BIT_2.7_3.2_writer |
| 2012年05月23日 21:10:42 | vstinner | set | files: + REPORT_32BIT_3.3 |
| 2012年05月23日 21:10:27 | vstinner | set | files: + REPORT_32BIT_2.7_3.2_writer |
| 2012年05月14日 12:05:13 | vstinner | set | hgrepos:
+ hgrepo125 messages: + msg160610 |
| 2012年05月13日 21:32:26 | pitrou | set | messages: + msg160581 |
| 2012年05月13日 21:26:53 | vstinner | set | messages: + msg160579 |
| 2012年05月13日 21:18:26 | serhiy.storchaka | set | messages: + msg160578 |
| 2012年05月13日 20:41:06 | vstinner | set | messages: + msg160574 |
| 2012年05月13日 20:15:42 | pitrou | set | messages: + msg160568 |
| 2012年05月13日 20:14:33 | serhiy.storchaka | set | messages: + msg160567 |
| 2012年05月13日 20:10:28 | vstinner | set | messages: + msg160565 |
| 2012年05月13日 09:56:38 | pitrou | set | messages: + msg160513 |
| 2012年05月13日 02:27:04 | vstinner | set | files: + benchmark.py |
| 2012年05月13日 02:26:48 | vstinner | set | files:
+ dont_overallocate.patch messages: + msg160503 |
| 2012年05月09日 20:32:53 | python-dev | set | nosy:
+ python-dev messages: + msg160313 |
| 2012年05月09日 11:53:45 | vstinner | set | files:
+ inline_unicode_writer.patch messages: + msg160277 |
| 2012年05月09日 08:03:21 | serhiy.storchaka | set | files:
+ issue14744-bench-1.py messages: + msg160259 |
| 2012年05月09日 05:38:23 | pitrou | set | messages: + msg160244 |
| 2012年05月08日 23:58:41 | vstinner | set | files:
+ format_writer-2.patch messages: + msg160236 |
| 2012年05月08日 10:38:52 | serhiy.storchaka | set | messages: + msg160194 |
| 2012年05月08日 09:02:13 | mark.dickinson | set | nosy:
+ mark.dickinson messages: + msg160190 |
| 2012年05月08日 08:37:05 | vstinner | set | messages: + msg160189 |
| 2012年05月08日 05:32:09 | serhiy.storchaka | set | messages: + msg160186 |
| 2012年05月07日 22:19:41 | vstinner | set | messages: + msg160177 |
| 2012年05月07日 22:09:20 | vstinner | create | |