[Python-Dev] urlparse brokenness

Wed Nov 23 06:04:55 CET 2005

It is my assertion that urlparse is currently broken. Specifically, I 
think that urlparse breaks an abstraction boundary with ill effect.
In writing a mailclient, I wished to allow my users to specify their
imap server as a url, such as 'imap://user:password@host:port/'. Which
worked fine. I then thought that the natural extension to support
configuration of imapssl would be 'imaps://user:password@host:port/'....
which failed - user:passwrod at host:port got parsed as the *path* of
the URL instead of the network location. It turns out that urlparse
keeps a table of url schemes that 'use netloc'... that is to say,
that have a 'user:password at host:port' part to their URL. I think this
'special knowledge' about particular schemes 1) breaks an abstraction
boundary by having a function whose charter is to pull apart a
particularly-formatted string behave differently based on the meaning of
the string instead of the structure of it and 2) fails to be extensible
or forward compatible due to hardcoded 'magic' strings - if schemes were
somehow 'registerable' as 'netloc using' or not, then this objection
might be nullified, but the previous objection would still stand.
So I propose that urlsplit, the main offender, be replaced with something
that looks like:
def urlsplit(url, scheme='', allow_fragments=1, default=('','','','','')):
 """Parse a URL into 5 components:
 <scheme>://<netloc>/<path>?<query>#<fragment>
 Return a 5-tuple: (scheme, netloc, path, query, fragment).
 Note that we don't break the components up in smaller bits
 (e.g. netloc is a single string) and we don't expand % escapes."""
 key = url, scheme, allow_fragments, default
 cached = _parse_cache.get(key, None)
 if cached:
 return cached
 if len(_parse_cache) >= MAX_CACHE_SIZE: # avoid runaway growth
 clear_cache()
 if "://" in url:
 uscheme, npqf = url.split("://", 1)
 else:
 uscheme = scheme
 if not uscheme:
 uscheme = default[0]
 npqf = url
 pathidx = npqf.find('/')
 if pathidx == -1: # not found
 netloc = npqf
 path, query, fragment = default[1:4]
 else:
 netloc = npqf[:pathidx]
 pqf = npqf[pathidx:]
 if '?' in pqf:
 path, qf = pqf.split('?',1)
 else:
 path, qf = pqf, ''.join(default[3:5])
 if ('#' in qf) and allow_fragments:
 query, fragment = qf.split('#',1)
 else:
 query, fragment = default[3:5]
 tuple = (uscheme, netloc, path, query, fragment)
 _parse_cache[key] = tuple
 return tuple
Note that I'm not sold on the _parse_cache, but I'm assuming it was there
for a reason so I'm leaving that functionality as-is.
If this isn't the right forum for this discussion, or the right place to 
submit code, please let me know. Also, please cc: me directly on responses
as I'm not subscribed to the firehose that is python-dev.
 --pj