Flexible string representation, unicode, typography, ...

Wed Aug 29 07:40:46 EDT 2012

Le mercredi 29 août 2012 06:16:05 UTC+2, Ian a écrit :
> On Tue, Aug 28, 2012 at 8:42 PM, rusi <rustompmody at gmail.com> wrote:
>> > In summary:
>> > 1. The problem is not on jmf's computer
>> > 2. It is not windows-only
>> > 3. It is not directly related to latin-1 encodable or not
>> >
>> > The only question which is not yet clear is this:
>> > Given a typical string operation that is complexity O(n), in more
>> > detail it is going to be O(a + bn)
>> > If only a is worse going 3.2 to 3.3, it may be a small issue.
>> > If b is worse by even a tiny amount, it is likely to be a significant
>> > regression for some use-cases.
>>>> As has been pointed out repeatedly already, this is a microbenchmark.
>> jmf is focusing in one one particular area (string construction) where
>> Python 3.3 happens to be slower than Python 3.2, ignoring the fact
>> that real code usually does lots of things other than building
>> strings, many of which are slower to begin with. In the real-world
>> benchmarks that I've seen, 3.3 is as fast as or faster than 3.2.
>> Here's a much more realistic benchmark that nonetheless still focuses
>> on strings: word counting.
>>>> Source: http://pastebin.com/RDeDsgPd
>>>>>> C:\Users\Ian\Desktop>c:\python32\python -m timeit -s "import wc"
>> "wc.wc('unilang8.htm')"
>> 1000 loops, best of 3: 310 usec per loop
>>>> C:\Users\Ian\Desktop>c:\python33\python -m timeit -s "import wc"
>> "wc.wc('unilang8.htm')"
>> 1000 loops, best of 3: 302 usec per loop
>>>> "unilang8.htm" is an arbitrary UTF-8 document containing a broad swath
>> of Unicode characters that I pulled off the web. Even though this
>> program is still mostly string processing, Python 3.3 wins. Of
>> course, that's not really a very good test -- since it reads the file
>> on every pass, it probably spends more time in I/O than it does in
>> actual processing. Let's try it again with prepared string data:
>>>>>> C:\Users\Ian\Desktop>c:\python32\python -m timeit -s "import wc; t =
>> open('unilang8.htm', 'r', encoding
>> ='utf-8').read()" "wc.wc_str(t)"
>> 10000 loops, best of 3: 87.3 usec per loop
>>>> C:\Users\Ian\Desktop>c:\python33\python -m timeit -s "import wc; t =
>> open('unilang8.htm', 'r', encoding
>> ='utf-8').read()" "wc.wc_str(t)"
>> 10000 loops, best of 3: 84.6 usec per loop
>>>> Nope, 3.3 still wins. And just for the sake of my own curiosity, I
>> decided to try it again using str.split() instead of a StringIO.
>> Since str.split() creates more strings, I expect Python 3.2 might
>> actually win this time.
>>>>>> C:\Users\Ian\Desktop>c:\python32\python -m timeit -s "import wc; t =
>> open('unilang8.htm', 'r', encoding
>> ='utf-8').read()" "wc.wc_split(t)"
>> 10000 loops, best of 3: 88 usec per loop
>>>> C:\Users\Ian\Desktop>c:\python33\python -m timeit -s "import wc; t =
>> open('unilang8.htm', 'r', encoding
>> ='utf-8').read()" "wc.wc_split(t)"
>> 10000 loops, best of 3: 76.5 usec per loop
>>>> Interestingly, although Python 3.2 performs the splits in about the
>> same time as the StringIO operation, Python 3.3 is significantly
>> *faster* using str.split(), at least on this data set.
>>>>>> > So doing some arm-chair thinking (I dont know the code and difficulty
>> > involved):
>> >
>> > Clearly there are 3 string-engines in the python 3 world:
>> > - 3.2 narrow
>> > - 3.2 wide
>> > - 3.3 (flexible)
>> >
>> > How difficult would it be to giving the choice of string engine as a
>> > command-line flag?
>> > This would avoid the nuisance of having two binaries -- narrow and
>> > wide.
>>>> Quite difficult. Even if we avoid having two or three separate
>> binaries, we would still have separate binary representations of the
>> string structs. It makes the maintainability of the software go down
>> instead of up.
>>>> > And it would give the python programmer a choice of efficiency
>> > profiles.
>>>> So instead of having just one test for my Unicode-handling code, I'll
>> now have to run that same test *three times* -- once for each possible
>> string engine option. Choice isn't always a good thing.
>>
Forget Python and all these benchmarks. The problem
is on an other level. Coding schemes, typography,
usage of characters, ...
For a given coding scheme, all code points/characters are
equivalent. Expecting to handle a sub-range in a coding
scheme without shaking that coding scheme is impossible.
If a coding scheme does not give satisfaction, the only
valid solution is to create a new coding scheme, cp1252,
mac-roman, EBCDIC, ... or the interesting "TeX" case, where
the "internal" coding depends on the fonts!
Unicode (utf***), as just one another coding scheme, does
not escape to this rule.
This "Flexible String Representation" fails. Not only
it is unable to stick with a coding scheme, it is
a mixing of coding schemes, the worst of all possible
implementations.
jmf