For a code challenge, I'm trying to write a comprehensive URI parser in Python that handles both URIs with authority
paths (ex: URLs such as http://user:[email protected]/page?key=value#fragment
) and other URI schemes (ex: mailto:[email protected]?subject=Blah
).
Here's my current code:
import json
import re
class Uri(object):
""" Utility class to handle URIs """
ESCAPE_CODES = {' ' : '%20', '<' : '%3C', '>' : '%3E', '#' : '%23', '%' : '%25', '{' : '%7B',
'}' : '%7D', '|' : '%7C', '\\' : '%5C', '^' : '%5E', '~' : '%7E', '[' : '%5B',
']' : '%5D', '`' : '%60', ';' : '%3B', '/' : '%2F', '?' : '%3F', ':' : '%3A',
'@' : '%40', '=' : '%3D', '&' : '%26', '$' : '%24'}
@staticmethod
def encode(string):
""" "Percent-encodes" the given string """
return ''.join(c if not c in Uri.ESCAPE_CODES else Uri.ESCAPE_CODES[c] for c in string)
# We could parse (most of) the URI using this regex given on the RFC 3986:
# http://tools.ietf.org/html/rfc3986#appendix-B
# We won't do it though because it spoils all the fun! \o/
# We're only going to use it detect broken URIs
URI_REGEX = "^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))?"
def __init__(self, uri):
""" Parses the given URI """
uri = uri.strip()
if not re.match(Uri.URI_REGEX, uri):
raise ValueError("The given URI isn't valid")
# URI scheme is case-insensitive
self.scheme = uri.split(':')[0].lower()
self.path = uri[len(self.scheme) + 1:]
# URI fragments
self.fragment = None
if '#' in self.path:
self.path, self.fragment = self.path.split('#')
# Query parameters (for instance: http://mysite.com/page?key=value&other_key=value2)
self.parameters = dict()
if '?' in self.path:
separator = '&' if '&' in self.path else ';'
query_params = self.path.split('?')[-1].split(separator)
query_params = map(lambda p : p.split('='), query_params)
self.parameters = {key : value for key, value in query_params}
self.path = self.path.split('?')[0]
# For URIs that have a path starting with '//', we try to fetch additional info:
self.authority = None
if self.path.startswith('//'):
self.path = self.path.lstrip('//')
uri_tokens = self.path.split('/')
self.authority = uri_tokens[0]
self.hostname = self.authority
self.path = self.path[len(self.authority):]
# Fetching authentication data. For instance: "http://login:[email protected]"
self.authenticated = '@' in self.authority
if self.authenticated:
self.user_information, self.hostname = self.authority.split('@', 1)
# Fetching port
self.port = None
if ':' in self.hostname:
self.hostname, self.port = self.hostname.split(':')
self.port = int(self.port)
# Hostnames are case-insensitive
self.hostname = self.hostname.lower()
def serialize_parameters(self):
""" Returns a serialied representation of the query parameters. """
return '&'.join('{}={}'.format(key, value) for key, value in sorted(self.parameters.iteritems()))
def __str__(self):
""" Outputs the URI as a string """
uri = '{}:'.format(Uri.encode(self.scheme))
if self.authority:
uri += '//'
if self.authenticated:
uri += Uri.encode(self.user_information) + '@'
uri += self.hostname
if self.port:
uri += ':{}'.format(self.port)
uri += self.path
if self.parameters:
uri += '?' + self.serialize_parameters()
if self.fragment:
uri += '#' + Uri.encode(self.fragment)
return uri
def json(self):
""" JSON serialization of the URI object """
return json.dumps(self.__dict__, sort_keys=True, indent=2)
def summary(self):
""" Summary of the URI object. Mostly for debug. """
uri_repr = '{}\n'.format(self)
uri_repr += '\n'
uri_repr += "* Schema name: '{}'\n".format(self.scheme)
if self.authority:
uri_repr += "* Authority path: '{}'\n".format(self.authority)
uri_repr += " . Hostname: '{}'\n".format(self.hostname)
if self.authenticated:
uri_repr += " . User information = '{}'\n".format(self.user_information)
if self.port:
uri_repr += " . Port = '{}'\n".format(self.port)
uri_repr += "* Path: '{}'\n".format(self.path)
if self.parameters:
uri_repr += "* Query parameters: '{}'\n".format(self.parameters)
if self.fragment:
uri_repr += "* Fragment: '{}'\n".format(self.fragment)
return uri_repr
Also hosted on github.
All feedback, including failure to respect PEP8 or existence of more "pythonic" methods, is welcome!
-
\$\begingroup\$ There is no better place to look at than the python standard library: hg.python.org/cpython/file/2.7/Lib/urlparse.py \$\endgroup\$Dim– Dim2013年09月17日 09:18:57 +00:00Commented Sep 17, 2013 at 9:18
-
\$\begingroup\$ It might seem arrogant, but I don't think urlparse's implementation isn't the most elegant one. (other Python Standard Libraries are also horrible, like the ones handling ZIP and TAR files) \$\endgroup\$halflings– halflings2013年09月17日 23:33:41 +00:00Commented Sep 17, 2013 at 23:33
1 Answer 1
You referenced RFC 3986, but I don't think you've tried to follow it.
In your constructor, you immediately lower-case everything. That is obviously wrong. RFC 3986 Sec. 6.2.2.1 says that only the scheme and host portions of URIs are case-insensitive.
You have an escape()
function, but oddly no unescape()
function, which I expect would be needed for parsing URIs. Please be aware when implementing unescape()
that query strings have special unescaping rules. The RFC uses the term "percent-encoding", so perhaps you should call it "encode" rather than "escape".
Your escape()
function only encodes specific characters, which is dangerous, considering that more characters exist that require encoding than that can be passed through.
Be careful when calling split()
where you expect at most one separator. You should use split(':', 1)
, split('@', 1)
and split('#', 1)
instead.
Better yet, don't try to split at all. Instead, consistently use regular expression capturing for identifying all parts of the URI. You should be able to make one huge regular expression. All complex regular expressions, including the one you are using now, should have embedded comments.
-
\$\begingroup\$ I quickly skimmed over RFC 3986 and have, as you proved it, missed many things. I'm gonna try to fix those asap! I tried to avoid using one big pile of regex (event commented) because I think the goal of this programming exercise was to parse string in various ways and show a good coding style. Using one big regex wouldn't be relevant IMHO. \$\endgroup\$halflings– halflings2013年09月15日 20:43:24 +00:00Commented Sep 15, 2013 at 20:43
-
\$\begingroup\$ Regarding the "percent-encoding", from what I've understood only reserved characters must be percented-encoded in the URI. And I think I already got all of those covered. \$\endgroup\$halflings– halflings2013年09月15日 20:57:56 +00:00Commented Sep 15, 2013 at 20:57
-
\$\begingroup\$ Sec 2.1: "A percent-encoding mechanism is used to represent a data octet in a component when that octet's corresponding character is outside the allowed set or is being used as a delimiter of, or within, the component." The reserved set contains a list of common delimiters. There are also plenty of characters outside the allowed set. Appendix A suggests that anything outside the nonreserved set should be percent-encoded if it's not being used for a syntactically significant role in the URL. \$\endgroup\$200_success– 200_success2013年09月16日 00:18:48 +00:00Commented Sep 16, 2013 at 0:18
-
\$\begingroup\$ OK! Forgot about non ASCII letters and other exotic characters. \$\endgroup\$halflings– halflings2013年09月16日 01:07:33 +00:00Commented Sep 16, 2013 at 1:07