homepage

This issue tracker has been migrated to GitHub , and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author tburke
Recipients tburke
Date 2019年03月12日.20:33:23
SpamBayes Score -1.0
Marked as misclassified Yes
Message-id <1552422803.93.0.420596825145.issue36274@roundup.psfhosted.org>
In-reply-to
Content
While the RFCs are rather clear that non-ASCII data would be out of spec,
* that doesn't prevent a poorly-behaved client from sending non-ASCII bytes on the wire, which means
* as an application developer, it's useful to be able to mimic such a client to verify expected behavior while still using stdlib to handle things like header parsing, particularly since
* this worked perfectly well on Python 2.
The two most-obvious ways (to me, anyway) to try to send a request for /你好 (for example) are
 # Assume it will get UTF-8 encoded, as that's the default encoding
 # for urllib.parse.quote()
 conn.putrequest('GET', '/\u4f60\u597d')
 # Assume it will get Latin-1 encoded, as
 # * that's the encoding used in http.client.parse_headers(),
 # * that's the encoding used for PEP-3333, and
 # * it has a one-to-one mapping with bytes
 conn.putrequest('GET', '/\xe4\xbd\xa0\xe5\xa5\xbd')
both fail with something like
 UnicodeEncodeError: 'ascii' codec can't encode characters in position ...
Trying to pre-encode like
 conn.putrequest('GET', b'/\xe4\xbd\xa0\xe5\xa5\xbd')
at least doesn't raise an error, but still does not do what was intended; rather than a request line like
 GET /你好 HTTP/1.1
(or
 /ä1⁄2 å1円⁄2
depending on how you choose to interpret the bytes), the server gets
 GET b'/\xe4\xbd\xa0\xe5\xa5\xbd' HTTP/1.1
The trouble comes down to https://github.com/python/cpython/blob/v3.7.2/Lib/http/client.py#L1104-L1107 -- we don't actually have any control over what the caller passes as the url (so the assumption doesn't hold), nor do we know anything about the encoding that was *intended*.
One of three fixes seems warranted:
* Switch to using Latin-1 to encode instead of ASCII (again, leaning on the precedent set in parse_headers and PEP-3333). This may make it too easy to write an out-of-spec client, however.
* Continue to use ASCII to encode, but include errors='surrogateescape' to give callers an escape hatch. This seems like a reasonably high bar to ensure that the caller actually intends to send unquoted data.
* Accept raw bytes and actually use them (rather than their repr()), allowing the caller to decide upon an appropriate encoding.
History
Date User Action Args
2019年03月12日 20:33:23tburkesetrecipients: + tburke
2019年03月12日 20:33:23tburkesetmessageid: <1552422803.93.0.420596825145.issue36274@roundup.psfhosted.org>
2019年03月12日 20:33:23tburkelinkissue36274 messages
2019年03月12日 20:33:23tburkecreate

AltStyle によって変換されたページ (->オリジナル) /