This issue tracker has been migrated to GitHub ,
and is currently read-only.
For more information,
see the GitHub FAQs in the Python's Developer Guide.
Created on 2010年03月08日 23:21 by vstinner, last changed 2022年04月11日 14:56 by admin. This issue is now closed.
| Files | ||||
|---|---|---|---|---|
| File name | Uploaded | Description | Edit | |
| utf8_surrogate_error-3.patch | vstinner, 2010年04月20日 21:59 | |||
| Messages (8) | |||
|---|---|---|---|
| msg100678 - (view) | Author: STINNER Victor (vstinner) * (Python committer) | Date: 2010年03月08日 23:21 | |
utf8 encoder doesn't work in backslashreplace error handler:
>>> "\uDC80".encode("utf8", "backslashreplace")
TypeError: error handler should have returned bytes
|
|||
| msg100681 - (view) | Author: STINNER Victor (vstinner) * (Python committer) | Date: 2010年03月08日 23:23 | |
See also issue #6697. |
|||
| msg100716 - (view) | Author: Walter Dörwald (doerwalter) * (Python committer) | Date: 2010年03月09日 11:55 | |
After the patch the comment: /* Implementation limitations: only support error handler that return bytes, and only support up to four replacement bytes. */ no longer applies. Also I would like to see a version of this patch where the length limitation for the replacement returned from the error handler is removed (ideally for both the str and bytes case). |
|||
| msg103741 - (view) | Author: STINNER Victor (vstinner) * (Python committer) | Date: 2010年04月20日 19:38 | |
New version without the hardcoded limit: don't use goto encodeUCS4;, chain if to limit indentation depth: it only costs one copy of the UCS4 (5 lines are duplicated). The buffer is now reallocated each time a surrogate escape is longer than 4 bytes. I don't know if "nallocated += repsize - 4;" can overflow or not. If yes, how can I detect the overflow? I added: /* FIXME: check integer overflow? */ |
|||
| msg103766 - (view) | Author: Antoine Pitrou (pitrou) * (Python committer) | Date: 2010年04月20日 21:09 | |
> I don't know if "nallocated += repsize - 4;" can overflow or not.
> If yes, how can I detect the overflow?
Sure, if they are both Py_ssize_t, just use:
if (nallocated > PY_SSIZE_T_MAX - repsize + 4) {
/* handle overflow ... */
}
|
|||
| msg103777 - (view) | Author: STINNER Victor (vstinner) * (Python committer) | Date: 2010年04月20日 21:50 | |
Oh no :-( I realized that I removed the first message of this issue! msg100687. Copy/paste of the message: --- This issue is a regression introduced by r72208 to fix the issue #3672. Attached patch fixes PyUnicode_EncodeUTF8() if unicode_encode_call_errorhandler() returns an unicode string (eg. backslackreplace error handler). I don't know unicodeobject.c code (very well), and my patch should be far from being perfect. I suppose that the maximum length of an escaped characters is 8 bytes (xmlcharrefreplace error error for U+DFFFF). When the first lone surrogate is found, reallocate the buffer to size*8 bytes. The escaped character have to be an ASCII character or an UnicodeEncodeError is raised. Note: unicode_encode_ucs1() doesn't have hardcoded for the maximum length ot escaped string. Its code might be reused in PyUnicode_EncodeUTF8() to remove the hardcoded limits. --- |
|||
| msg103779 - (view) | Author: STINNER Victor (vstinner) * (Python committer) | Date: 2010年04月20日 21:59 | |
Oops, I forgot the remove the reallocation in the unicode case in the patch version 2. Patch version 3: - micro-optimization: group both surrogates cases in the same if to avoid checking 0xD800 <= ch twice - check for integer overflow - (remove the duplication reallocation introduced by version 2) I think that PyUnicode_EncodeUTF8() is more readable after my patch: there maximum if depth is 2 instead of 3, and I removed the goto. It shouldn't change anything about performances for chacters < 0x800 (ASCII and Latin-1), and I expect similar performances for characters >= 0x800. |
|||
| msg103977 - (view) | Author: STINNER Victor (vstinner) * (Python committer) | Date: 2010年04月22日 19:41 | |
Fixed: r80382 (py3k), r80383 (3.1). |
|||
| History | |||
|---|---|---|---|
| Date | User | Action | Args |
| 2022年04月11日 14:56:58 | admin | set | github: 52339 |
| 2010年04月22日 19:41:20 | vstinner | set | status: open -> closed resolution: fixed messages: + msg103977 |
| 2010年04月22日 10:26:00 | vstinner | set | files: - utf8_surrogate_error-2.patch |
| 2010年04月20日 21:59:51 | vstinner | set | files:
+ utf8_surrogate_error-3.patch messages: + msg103779 |
| 2010年04月20日 21:50:24 | vstinner | set | messages: + msg103777 |
| 2010年04月20日 21:09:53 | pitrou | set | nosy:
+ pitrou messages: + msg103766 |
| 2010年04月20日 21:08:37 | vstinner | set | files: - utf8_surrogate_error.patch |
| 2010年04月20日 19:38:47 | vstinner | set | files:
+ utf8_surrogate_error-2.patch messages: + msg103741 |
| 2010年04月20日 19:36:44 | vstinner | set | messages: - msg100687 |
| 2010年04月20日 11:16:16 | vstinner | link | issue8242 dependencies |
| 2010年03月09日 11:55:53 | doerwalter | set | nosy:
+ doerwalter messages: + msg100716 |
| 2010年03月09日 01:37:20 | pitrou | set | versions:
+ Python 3.2 nosy: + lemburg, loewis priority: normal type: behavior stage: patch review |
| 2010年03月09日 01:11:58 | vstinner | set | files:
+ utf8_surrogate_error.patch keywords: + patch messages: + msg100687 |
| 2010年03月08日 23:23:25 | vstinner | set | messages: + msg100681 |
| 2010年03月08日 23:21:05 | vstinner | create | |