[Python-Dev] urllib.quote and unquote - Unicode issues

Thu Jul 31 08:36:30 CEST 2008

Matt Giuca writes:
 > OK, for all the people who say URI encoding does not encode characters: yes
 > it does. This is not an encoding for binary data, it's an encoding for
 > character data, but it's unspecified how the strings map to octets before
 > being percent-encoded.
In other words, it's an encoding for binary data, since the octet
sequences that might be encountered are completely unrestricted. I
have to side with Bill on this. URIs are sequences of characters, but
the character set used must contain the ASCII repertoire as a subset,
of which the URI delimiters must be mapped to the corresponding ASCII
codes, the rest of the set must be represented as sequences of octets
(which need not even be constant; you could gzip them first for all
URI-encoding cares).
URI-encoding itself is a purely mechanical process which transforms
reserved octets (not used as delimiters) to percent codes.
 > From RFC 3986, section
 > 1.2.1<http://tools.ietf.org/html/rfc3986#section-1.2.1>:
 > > Percent-encoded octets (Section 2.1) may be used within a URI to represent
 > > characters outside the range of the US-ASCII coded character set if this
 > > representation is allowed by the scheme or by the protocol element in which
 > > the URI is referenced. Such a definition should specify the character
 > > encoding used to map those characters to octets prior to being
 > > percent-encoded for the URI.
This is kinda perverted, but suppose you have bytes which are actually a
Japanese string represented in packed EUC-JP. AFAICS the paragraph above
does *not* say you can't transcode to UTF-8 before percent-encoding, and
in fact you might be required to by the definition of the scheme.
 > So the string->string proposal is actually correct behaviour.
Ye-e-es, but. What the RFC clearly envisions is not that the
percent-encoder will be handed an unencoded string that looks like a
URI, but rather a sequence of octets representing one component
(scheme, authority, path, query, etc) of a URI.
In other words, a string->string URI encoder should only be called by
an URI builder, and never with a precomposed URI-like string.
Something like
def URIBuilder (strings):
 """Return an URI built from a list of strings.
 The first string *must* be the scheme.
 If the URI follows the generic URI syntax of RFC 3986, the
 remaining components should be given in the order authority, path,
 fragment, query part [, query part ...]."""
 def uriencode (s):
 """URI encode a string per RFC 3986 Section 3."""
 # We all know what this does.
 if strings[0] == "http":
 # HTTP scheme, delimiters and authority
 uri = "http://" + uriencode(strings[1]) + "/"
 # path, if present
 if strings[2]:
 uri = uri + uriencode(strings[2])
 # query, if present
 if strings[4]:
 uri = uri + "?" + uriencode(strings[4])
 # further query parameters, if present
 for s in strings[4:]
 uri = uri + ";" + uriencode(s)
 # fragment, if present
 if strings[3]:
 uri = uri + "#" + uriencode(strings[3])
 else if strings[0] == "mailto":
 uri = "mailto:" + uriencode(strings[1])
 # etc etc
 return uri
I think you'd have a much easier time enforcing this pedantically
correct usage with a bytes->bytes encoder.
Of course, it's un-Pythonic to enforce pedantry, and we pedants can
use a string->string encoder correctly.
 > You really want me to remove the encoding= named argument? And hard-code
 > UTF-8 into these functions?
A quoting function that accepts bytes *must* have an encoding
argument. There's no point to passing the quoter bytes unless the
text is represented in a non-Unicode encoding.