Issue 34454: datetime: NULL dereference in fromisoformat() on PyUnicode_AsUTF8AndSize() failure

➜

This issue tracker has been migrated to GitHub , and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/78635

classification

Title:	datetime: NULL dereference in fromisoformat() on PyUnicode_AsUTF8AndSize() failure
Type:	crash	Stage:	resolved
Components:	Extension Modules	Versions:	Python 3.8, Python 3.7

process

Dependencies:	Superseder:
Status:	closed	Resolution:	fixed
Assigned To:	Nosy List:	belopolsky, izbyshev, miss-islington, p-ganssle, serhiy.storchaka, taleinat, vstinner
Priority:	normal	Keywords:	patch

Created on 2018年08月21日 21:38 by izbyshev, last changed 2022年04月11日 14:59 by admin. This issue is now closed.

Pull Requests
URL	Status	Linked	Edit
PR 8850	closed	izbyshev, 2018年08月21日 21:42
PR 8859	closed	taleinat, 2018年08月22日 12:17
PR 8862	merged	p-ganssle, 2018年08月22日 16:31
PR 8877	merged	miss-islington, 2018年08月23日 15:06
PR 8878	merged	izbyshev, 2018年08月23日 17:49
PR 8959	merged	p-ganssle, 2018年08月27日 21:22
PR 10041	merged	miss-islington, 2018年10月22日 20:45

Messages (12)
msg323844 - (view)	Author: Alexey Izbyshev (izbyshev) * (Python triager)	Date: 2018年08月21日 21:38
A failure of PyUnicode_AsUTF8AndSize() in various fromisoformat() functions in Modules/_datetimemodule.c leads to NULL dereference due to the missing check, e.g.: >>> from datetime import date >>> date.fromisoformat('\ud800') Segmentation fault (core dumped) This is similar to msg123474. The missing NULL check was reported by Svace static analyzer. While preparing tests for this issue, I've discovered a deeper problem. The C datetime implementation uses PyUnicode_AsUTF8AndSize() in several places, making some functions reject strings containing surrogate code points (0xD800 - 0xDFFF) since they can't be encoded in UTF-8. On the other hand, the pure-Python datetime implementation doesn't have this restriction. For example: >>> import sys >>> sys.modules['_datetime'] = None # block C implementation >>> from datetime import time >>> time().strftime('\ud800') '\ud800' >>> del sys.modules['datetime'] >>> del sys.modules['_datetime'] >>> from datetime import time >>> time().strftime('\ud800') Traceback (most recent call last): File "<stdin>", line 1, in <module> UnicodeEncodeError: 'utf-8' codec can't encode character '\ud800' in position 0: surrogates not allowed My PR (coming soon) doesn't address this difference but focuses on fixing the immediate problem instead. Suggestions are appreciated.
msg323846 - (view)	Author: Paul Ganssle (p-ganssle) * (Python committer)	Date: 2018年08月21日 22:05
So this is related to something I was actually meaning to fix. When I wrote this code I didn't understand the way PyUnicode works, there's actually no need to call `PyUnicode_AsUTF8AndSize()` on the entire unicode string. My understanding is that each glyph in a given PyUnicode object is the same size, which means that this section of the code can go: https://github.com/python/cpython/blob/master/Modules/_datetimemodule.c#L4862 Instead we can just break the string up as glyphs 0-10 and 11+ and pass them on. Since by the contract of the function glyphs 0-10 and 11+ must be ASCII, we no longer need to worry about valid use cases where a character un-representable by UTF-8 will lead to anything except an error. Obviously the null pointer error needs to be fixed since it should raise an error and not segfault. I'd be happy to do the part where the string is broken up before being passed to PyUnicode_AsUTF8AndSize() if it would make it easier to implement your PR (which seems to touch a lot of other parts of the code as well).
msg323849 - (view)	Author: Alexey Izbyshev (izbyshev) * (Python triager)	Date: 2018年08月21日 22:34
I will be glad to rebase my PR and remove the try/except from the test if you remove the dependency of separator searching code on PyUnicode_AsUTF8AndSize() as you suggest. Or we can go the other way and merge mine first -- whatever you prefer. Note that technically a difference between C and Python implementation of fromisoformat() will still remain: if a part of the input string before or after the separator contains surrogates, the C code will throw a UnicodeEncodeError while the Python code -- ValueError. But since the former error is a subclass of the latter, I guess it's OK, what do you think? Also, note that the other discovered C/Python impl difference (for strftime, handled by another try/catch in tests) won't go away, of course, unless someone is ready to fix that as well.
msg323856 - (view)	Author: Paul Ganssle (p-ganssle) * (Python committer)	Date: 2018年08月22日 00:00
> Note that technically a difference between C and Python implementation of fromisoformat() will still remain: if a part of the input string before or after the separator contains surrogates, the C code will throw a UnicodeEncodeError while the Python code -- ValueError. But since the former error is a subclass of the latter, I guess it's OK, what do you think? I think the fact that the unicode string is decoded is an implementation detail. I would suggest swallowing the decode error and raising a standard ValueError.
msg323875 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer)	Date: 2018年08月22日 12:21
I agree, that raising UnicodeEncodeError instead of ValueError is acceptable, since the former is a subclass of the latter. But since the default implementation raises more specialized subclass, it is probably that the user code will catch more narrow exception type. Seems, it is not hard to make the same exception be raised in all cases. For example, in date_fromisoformat(): if (!dt_ptr \|\| len != 10 \|\| parse_isoformat_date(dt_ptr, &year, &month, &day) < 0) { PyErr_Format(PyExc_ValueError, "Invalid isoformat string: %R", dstr); return NULL; }
msg323954 - (view)	Author: Tal Einat (taleinat) * (Python committer)	Date: 2018年08月23日 15:06
New changeset 096329f0b2bf5e3f0a16363aa631d993ce078737 by Tal Einat (Paul Ganssle) in branch 'master': bpo-34454: fix .fromisoformat() methods crashing on inputs with surrogate code points (GH-8862) https://github.com/python/cpython/commit/096329f0b2bf5e3f0a16363aa631d993ce078737
msg323958 - (view)	Author: miss-islington (miss-islington)	Date: 2018年08月23日 15:54
New changeset 89b1654e0bc7bc69709dca86dd4c92eb7122ac7e by Miss Islington (bot) in branch '3.7': bpo-34454: fix .fromisoformat() methods crashing on inputs with surrogate code points (GH-8862) https://github.com/python/cpython/commit/89b1654e0bc7bc69709dca86dd4c92eb7122ac7e
msg323960 - (view)	Author: Tal Einat (taleinat) * (Python committer)	Date: 2018年08月23日 16:47
Thanks for the report and initial patch, Alexey! Thanks for the followup and final PR, Paul!
msg328263 - (view)	Author: STINNER Victor (vstinner) * (Python committer)	Date: 2018年10月22日 16:14
I reopen the issue since there are still 2 open PRs: PR 8878 and PR 8959.
msg328272 - (view)	Author: STINNER Victor (vstinner) * (Python committer)	Date: 2018年10月22日 19:37
New changeset 3df85404d4bf420db3362eeae1345f2cad948a71 by Victor Stinner (Paul Ganssle) in branch 'master': bpo-34454: Clean up datetime.fromisoformat surrogate handling (GH-8959) https://github.com/python/cpython/commit/3df85404d4bf420db3362eeae1345f2cad948a71
msg328278 - (view)	Author: miss-islington (miss-islington)	Date: 2018年10月22日 22:35
New changeset 18450be94d74b5c7e05a08f4aaa4792749ecda18 by Miss Islington (bot) in branch '3.7': bpo-34454: Clean up datetime.fromisoformat surrogate handling (GH-8959) https://github.com/python/cpython/commit/18450be94d74b5c7e05a08f4aaa4792749ecda18
msg328294 - (view)	Author: STINNER Victor (vstinner) * (Python committer)	Date: 2018年10月23日 09:19
Thanks Paul Ganssle for your fixes!

History
Date	User	Action	Args
2022年04月11日 14:59:04	admin	set	github: 78635
2018年10月23日 09:19:02	vstinner	set	messages: + msg328294
2018年10月23日 07:06:56	taleinat	set	status: open -> closed resolution: fixed stage: patch review -> resolved
2018年10月22日 22:35:20	miss-islington	set	messages: + msg328278
2018年10月22日 20:45:42	miss-islington	set	stage: resolved -> patch review pull_requests: + pull_request9380
2018年10月22日 19:37:26	vstinner	set	messages: + msg328272
2018年10月22日 16:14:13	vstinner	set	status: closed -> open nosy: + vstinner messages: + msg328263 resolution: fixed -> (no value)
2018年08月27日 21:22:12	p-ganssle	set	pull_requests: + pull_request8434
2018年08月23日 17:49:49	izbyshev	set	pull_requests: + pull_request8354
2018年08月23日 16:47:49	taleinat	set	status: open -> closed resolution: fixed messages: + msg323960 stage: patch review -> resolved
2018年08月23日 15:54:51	miss-islington	set	nosy: + miss-islington messages: + msg323958
2018年08月23日 15:06:35	miss-islington	set	pull_requests: + pull_request8352
2018年08月23日 15:06:23	taleinat	set	nosy: + taleinat messages: + msg323954
2018年08月22日 16:31:27	p-ganssle	set	pull_requests: + pull_request8336
2018年08月22日 12:21:25	serhiy.storchaka	set	messages: + msg323875
2018年08月22日 12:17:16	taleinat	set	pull_requests: + pull_request8331
2018年08月22日 00:00:12	p-ganssle	set	messages: + msg323856
2018年08月21日 22:34:32	izbyshev	set	messages: + msg323849
2018年08月21日 22:05:29	p-ganssle	set	messages: + msg323846
2018年08月21日 21:42:24	izbyshev	set	keywords: + patch stage: patch review pull_requests: + pull_request8322
2018年08月21日 21:38:29	izbyshev	create

homepage