I'm a little frustrated with the state of url parsing in python, although I sympathize with the challenges. Today I just needed a tool to join path parts and normalize slashes without accidentally losing other parts of the URL, so I wrote this:
from urlparse import urlsplit, urlunsplit
def url_path_join(*parts):
"""Join and normalize url path parts with a slash."""
schemes, netlocs, paths, queries, fragments = zip(*(urlsplit(part) for part in parts))
# Use the first value for everything but path. Join the path on '/'
scheme = next((x for x in schemes if x), '')
netloc = next((x for x in netlocs if x), '')
path = '/'.join(x.strip('/') for x in paths if x)
query = next((x for x in queries if x), '')
fragment = next((x for x in fragments if x), '')
return urlunsplit((scheme, netloc, path, query, fragment))
As you can see, it's not very DRY, but it does do what I need, which is this:
>>> url_path_join('https://example.org/fizz', 'buzz')
'https://example.org/fizz/buzz'
Another example:
>>> parts=['https://', 'http://www.example.org', '?fuzz=buzz']
>>> '/'.join([x.strip('/') for x in parts]) # Not sufficient
'https:/http://www.example.org/?fuzz=buzz'
>>> url_path_join(*parts)
'https://www.example.org?fuzz=buzz'
Can you recommend an approach that is readable without being even more repetitive and verbose?
2 Answers 2
I'd suggest the following improvements (in descending order of importance):
- Extract your redundant generator expression to a function so it only occurs once. To preserve flexibility, introduce
default
as an optional parameter - This makes the comment redundant because
first
is a self-documenting name (you could call itfirst_or_default
if you want to be more explicit), so you can remove that - Rephrase your docstring to make it more readable: normalize and with a slash don't make sense together
- PEP 8 suggests not to align variable assignments, so does Clean Code by Robert C. Martin. However, it's more important to be consistent within your project.
def url_path_join(*parts):
"""Normalize url parts and join them with a slash."""
schemes, netlocs, paths, queries, fragments = zip(*(urlsplit(part) for part in parts))
scheme = first(schemes)
netloc = first(netlocs)
path = '/'.join(x.strip('/') for x in paths if x)
query = first(queries)
fragment = first(fragments)
return urlunsplit((scheme, netloc, path, query, fragment))
def first(sequence, default=''):
return next((x for x in sequence if x), default)
If you're looking for something a bit more radical in nature, why not let first
handle several sequences at once? (Note that unfortunately, you cannot combine default parameters with sequence-unpacking in Python 2.7, which has been fixed in Python 3.)
def url_path_join(*parts):
"""Normalize url parts and join them with a slash."""
schemes, netlocs, paths, queries, fragments = zip(*(urlsplit(part) for part in parts))
scheme, netloc, query, fragment = first_of_each(schemes, netlocs, queries, fragments)
path = '/'.join(x.strip('/') for x in paths if x)
return urlunsplit((scheme, netloc, path, query, fragment))
def first_of_each(*sequences):
return (next((x for x in sequence if x), '') for sequence in sequences)
I fully agree with Blender. Just use the os.path module. It provides a method to join paths and it also has methods to normalize pathnames (eg. os.path.normpath(pathname)) to use it on every OS with different separators.
-
2\$\begingroup\$
os.path.normpath
does not use forward slashes on Windows by default, and it's not intelligent about keeping the double slashes afterhttp://
. \$\endgroup\$kojiro– kojiro2012年07月18日 14:49:48 +00:00Commented Jul 18, 2012 at 14:49 -
\$\begingroup\$ @kojiro Yes, but you can
import posixpath
instead ofimport os.path
to get the correct slashes. I do believe you are correct about thehttp://
though \$\endgroup\$Matt– Matt2012年10月16日 14:15:58 +00:00Commented Oct 16, 2012 at 14:15
os.path.join
for taking care of the joining and such? \$\endgroup\$return '/'.join([x.strip('/') for x in parts])
\$\endgroup\$parts=['https://', 'http://www.example.org', '?fuzz=buzz']
? \$\endgroup\$urlparse.urljoin
today because it culls any existing paths on the first parameter. Who in the hell thought that was a great idea? \$\endgroup\$