Message102320
| Author |
ezio.melotti |
| Recipients |
dangra, ezio.melotti, lemburg, sjmachin, vstinner |
| Date |
2010年04月04日.05:49:13 |
| SpamBayes Score |
1.3444821e-08 |
| Marked as misclassified |
No |
| Message-id |
<1270360159.99.0.657484109192.issue8271@psf.upfronthosting.co.za> |
| In-reply-to |
| Content |
This new patch (v3) should be ok.
I added a few more tests and found another corner case:
'\xe1a'.decode('utf-8', 'replace') was returning u'\ufffd' because \xe1 is the start byte of a 3-byte sequence and there were only two bytes in the string. This is now fixed in the latest patch.
I also unrolled all the loops except the first one because I haven't found an elegant way to unroll it (yet).
Finally, I changed the error messages to make them clearer:
unexpected code byte -> invalid start byte;
invalid data -> invalid continuation byte.
(I can revert this if the old messages are better or if it is better to fix this with a separate commit.)
The performances seem more or less the same, I did some benchmarks without significant changes in the results. If you have better benchmarks let me know. I used a file of 320kB with some ASCII, ASCII mixed with some accented characters, Japanese and a file with a sample of several different Unicode chars. |
|
History
|
|---|
| Date |
User |
Action |
Args |
| 2010年04月04日 05:49:20 | ezio.melotti | set | recipients:
+ ezio.melotti, lemburg, sjmachin, vstinner, dangra |
| 2010年04月04日 05:49:19 | ezio.melotti | set | messageid: <1270360159.99.0.657484109192.issue8271@psf.upfronthosting.co.za> |
| 2010年04月04日 05:49:17 | ezio.melotti | link | issue8271 messages |
| 2010年04月04日 05:49:16 | ezio.melotti | create |
|