Message120982
| Author |
belopolsky |
| Recipients |
belopolsky, ezio.melotti, lemburg, loewis, vstinner |
| Date |
2010年11月11日.23:05:52 |
| SpamBayes Score |
5.483061e-08 |
| Marked as misclassified |
No |
| Message-id |
<1289516754.09.0.284658362081.issue10382@psf.upfronthosting.co.za> |
| In-reply-to |
| Content |
haypo> See also #2382: I wrote patches two years ago for this issue.
Yes, this is the same issue. I don't want to close this as a duplicate because #2382 contains a much more ambitious set of patches. What I am trying to achieve here is similar to the adjust_offset.patch there.
I am attaching a patch that takes an alternative approach and computes the number of characters in the parser. I strongly believe that the buffer in the tokenizer always contains UTF-8 encoded text. If it is not so already, I would consider making it so by replacing a call to _PyUnicode_AsDefaultEncodedString() with a call to PyUnicode_AsUTF8String(). (if that matters)
The patch still needs unittests and possibly has some off-by-one issues, but I would like to get to an agreement that this is the right level at which the problem should be fixed first. |
|
History
|
|---|
| Date |
User |
Action |
Args |
| 2010年11月11日 23:05:54 | belopolsky | set | recipients:
+ belopolsky, lemburg, loewis, vstinner, ezio.melotti |
| 2010年11月11日 23:05:54 | belopolsky | set | messageid: <1289516754.09.0.284658362081.issue10382@psf.upfronthosting.co.za> |
| 2010年11月11日 23:05:52 | belopolsky | link | issue10382 messages |
| 2010年11月11日 23:05:52 | belopolsky | create |
|