A (comprehensive) URI parser for Python

Question 1

For a code challenge, I'm trying to write a comprehensive URI parser in Python that handles both URIs with authority paths (ex: URLs such as http://user:[email protected]/page?key=value#fragment) and other URI schemes (ex: mailto:[email protected]?subject=Blah).

Here's my current code:

import json
import re
class Uri(object):
 """ Utility class to handle URIs """
 ESCAPE_CODES = {' ' : '%20', '<' : '%3C', '>' : '%3E', '#' : '%23', '%' : '%25', '{' : '%7B',
 '}' : '%7D', '|' : '%7C', '\\' : '%5C', '^' : '%5E', '~' : '%7E', '[' : '%5B',
 ']' : '%5D', '`' : '%60', ';' : '%3B', '/' : '%2F', '?' : '%3F', ':' : '%3A',
 '@' : '%40', '=' : '%3D', '&' : '%26', '$' : '%24'}
 @staticmethod
 def encode(string):
 """ "Percent-encodes" the given string """
 return ''.join(c if not c in Uri.ESCAPE_CODES else Uri.ESCAPE_CODES[c] for c in string)
 # We could parse (most of) the URI using this regex given on the RFC 3986:
 # http://tools.ietf.org/html/rfc3986#appendix-B
 # We won't do it though because it spoils all the fun! \o/
 # We're only going to use it detect broken URIs
 URI_REGEX = "^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))?"
 def __init__(self, uri):
 """ Parses the given URI """
 uri = uri.strip()
 if not re.match(Uri.URI_REGEX, uri):
 raise ValueError("The given URI isn't valid")
 # URI scheme is case-insensitive
 self.scheme = uri.split(':')[0].lower()
 self.path = uri[len(self.scheme) + 1:]
 # URI fragments
 self.fragment = None
 if '#' in self.path:
 self.path, self.fragment = self.path.split('#')
 # Query parameters (for instance: http://mysite.com/page?key=value&other_key=value2)
 self.parameters = dict()
 if '?' in self.path:
 separator = '&' if '&' in self.path else ';'
 query_params = self.path.split('?')[-1].split(separator)
 query_params = map(lambda p : p.split('='), query_params)
 self.parameters = {key : value for key, value in query_params}
 self.path = self.path.split('?')[0]
 # For URIs that have a path starting with '//', we try to fetch additional info:
 self.authority = None
 if self.path.startswith('//'):
 self.path = self.path.lstrip('//')
 uri_tokens = self.path.split('/')
 self.authority = uri_tokens[0]
 self.hostname = self.authority
 self.path = self.path[len(self.authority):]
 # Fetching authentication data. For instance: "http://login:[email protected]"
 self.authenticated = '@' in self.authority
 if self.authenticated:
 self.user_information, self.hostname = self.authority.split('@', 1)
 # Fetching port
 self.port = None
 if ':' in self.hostname:
 self.hostname, self.port = self.hostname.split(':')
 self.port = int(self.port)
 # Hostnames are case-insensitive
 self.hostname = self.hostname.lower()
 def serialize_parameters(self):
 """ Returns a serialied representation of the query parameters. """
 return '&'.join('{}={}'.format(key, value) for key, value in sorted(self.parameters.iteritems()))
 def __str__(self):
 """ Outputs the URI as a string """
 uri = '{}:'.format(Uri.encode(self.scheme))
 if self.authority:
 uri += '//'
 if self.authenticated:
 uri += Uri.encode(self.user_information) + '@'
 uri += self.hostname
 if self.port:
 uri += ':{}'.format(self.port)
 uri += self.path
 if self.parameters:
 uri += '?' + self.serialize_parameters()
 if self.fragment:
 uri += '#' + Uri.encode(self.fragment)
 return uri
 def json(self):
 """ JSON serialization of the URI object """
 return json.dumps(self.__dict__, sort_keys=True, indent=2)
 def summary(self):
 """ Summary of the URI object. Mostly for debug. """
 uri_repr = '{}\n'.format(self)
 uri_repr += '\n'
 uri_repr += "* Schema name: '{}'\n".format(self.scheme)
 if self.authority:
 uri_repr += "* Authority path: '{}'\n".format(self.authority)
 uri_repr += " . Hostname: '{}'\n".format(self.hostname)
 if self.authenticated:
 uri_repr += " . User information = '{}'\n".format(self.user_information)
 if self.port:
 uri_repr += " . Port = '{}'\n".format(self.port)
 uri_repr += "* Path: '{}'\n".format(self.path)
 if self.parameters:
 uri_repr += "* Query parameters: '{}'\n".format(self.parameters)
 if self.fragment:
 uri_repr += "* Fragment: '{}'\n".format(self.fragment)
 return uri_repr

Also hosted on github.

All feedback, including failure to respect PEP8 or existence of more "pythonic" methods, is welcome!

Question 2

There is no better place to look at than the python standard library: hg.python.org/cpython/file/2.7/Lib/urlparse.py

Question 3

It might seem arrogant, but I don't think urlparse's implementation isn't the most elegant one. (other Python Standard Libraries are also horrible, like the ones handling ZIP and TAR files)

Question 4

You referenced RFC 3986, but I don't think you've tried to follow it.

In your constructor, you immediately lower-case everything. That is obviously wrong. RFC 3986 Sec. 6.2.2.1 says that only the scheme and host portions of URIs are case-insensitive.

You have an escape() function, but oddly no unescape() function, which I expect would be needed for parsing URIs. Please be aware when implementing unescape() that query strings have special unescaping rules. The RFC uses the term "percent-encoding", so perhaps you should call it "encode" rather than "escape".

Your escape() function only encodes specific characters, which is dangerous, considering that more characters exist that require encoding than that can be passed through.

Be careful when calling split() where you expect at most one separator. You should use split(':', 1), split('@', 1) and split('#', 1) instead.

Better yet, don't try to split at all. Instead, consistently use regular expression capturing for identifying all parts of the URI. You should be able to make one huge regular expression. All complex regular expressions, including the one you are using now, should have embedded comments.

Question 5

I quickly skimmed over RFC 3986 and have, as you proved it, missed many things. I'm gonna try to fix those asap! I tried to avoid using one big pile of regex (event commented) because I think the goal of this programming exercise was to parse string in various ways and show a good coding style. Using one big regex wouldn't be relevant IMHO.

Question 6

Regarding the "percent-encoding", from what I've understood only reserved characters must be percented-encoded in the URI. And I think I already got all of those covered.

Question 7

Sec 2.1: "A percent-encoding mechanism is used to represent a data octet in a component when that octet's corresponding character is outside the allowed set or is being used as a delimiter of, or within, the component." The reserved set contains a list of common delimiters. There are also plenty of characters outside the allowed set. Appendix A suggests that anything outside the nonreserved set should be percent-encoded if it's not being used for a syntactically significant role in the URL.

Question 8

OK! Forgot about non ASCII letters and other exotic characters.

score 5 · Accepted Answer · 2013-09-15 18:38:27Z

You referenced RFC 3986, but I don't think you've tried to follow it.

In your constructor, you immediately lower-case everything. That is obviously wrong. RFC 3986 Sec. 6.2.2.1 says that only the scheme and host portions of URIs are case-insensitive.

You have an escape() function, but oddly no unescape() function, which I expect would be needed for parsing URIs. Please be aware when implementing unescape() that query strings have special unescaping rules. The RFC uses the term "percent-encoding", so perhaps you should call it "encode" rather than "escape".

Your escape() function only encodes specific characters, which is dangerous, considering that more characters exist that require encoding than that can be passed through.

Be careful when calling split() where you expect at most one separator. You should use split(':', 1), split('@', 1) and split('#', 1) instead.

Better yet, don't try to split at all. Instead, consistently use regular expression capturing for identifying all parts of the URI. You should be able to make one huge regular expression. All complex regular expressions, including the one you are using now, should have embedded comments.

I quickly skimmed over RFC 3986 and have, as you proved it, missed many things. I'm gonna try to fix those asap! I tried to avoid using one big pile of regex (event commented) because I think the goal of this programming exercise was to parse string in various ways and show a good coding style. Using one big regex wouldn't be relevant IMHO.
Regarding the "percent-encoding", from what I've understood only reserved characters must be percented-encoded in the URI. And I think I already got all of those covered.
Sec 2.1: "A percent-encoding mechanism is used to represent a data octet in a component when that octet's corresponding character is outside the allowed set or is being used as a delimiter of, or within, the component." The reserved set contains a list of common delimiters. There are also plenty of characters outside the allowed set. Appendix A suggests that anything outside the nonreserved set should be percent-encoded if it's not being used for a syntactically significant role in the URL.
OK! Forgot about non ASCII letters and other exotic characters.

Stack Exchange Network

A (comprehensive) URI parser for Python

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

A (comprehensive) URI parser for Python

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions