Joining url path components intelligently

Question 1

I'm a little frustrated with the state of url parsing in python, although I sympathize with the challenges. Today I just needed a tool to join path parts and normalize slashes without accidentally losing other parts of the URL, so I wrote this:

from urlparse import urlsplit, urlunsplit
def url_path_join(*parts):
 """Join and normalize url path parts with a slash."""
 schemes, netlocs, paths, queries, fragments = zip(*(urlsplit(part) for part in parts))
 # Use the first value for everything but path. Join the path on '/'
 scheme = next((x for x in schemes if x), '')
 netloc = next((x for x in netlocs if x), '')
 path = '/'.join(x.strip('/') for x in paths if x)
 query = next((x for x in queries if x), '')
 fragment = next((x for x in fragments if x), '')
 return urlunsplit((scheme, netloc, path, query, fragment))

As you can see, it's not very DRY, but it does do what I need, which is this:

>>> url_path_join('https://example.org/fizz', 'buzz')
'https://example.org/fizz/buzz'

Another example:

>>> parts=['https://', 'http://www.example.org', '?fuzz=buzz']
>>> '/'.join([x.strip('/') for x in parts]) # Not sufficient
'https:/http://www.example.org/?fuzz=buzz'
>>> url_path_join(*parts)
'https://www.example.org?fuzz=buzz'

Can you recommend an approach that is readable without being even more repetitive and verbose?

Question 2

Why can't you just use os.path.join for taking care of the joining and such?

Question 3

@Blender what if someone runs this code on Windows?

Question 4

Could you please give more examples of the kind of input that would require your above code to accomplish? (i.e provide some test cases that should pass for the solution to be acceptable?). What about some thing simple such as return '/'.join([x.strip('/') for x in parts])

Question 5

Sure, how about parts=['https://', 'http://www.example.org', '?fuzz=buzz']?

Question 6

Just got burned by urlparse.urljoin today because it culls any existing paths on the first parameter. Who in the hell thought that was a great idea?

Question 7

I'd suggest the following improvements (in descending order of importance):

Extract your redundant generator expression to a function so it only occurs once. To preserve flexibility, introduce default as an optional parameter
This makes the comment redundant because first is a self-documenting name (you could call it first_or_default if you want to be more explicit), so you can remove that
Rephrase your docstring to make it more readable: normalize and with a slash don't make sense together
PEP 8 suggests not to align variable assignments, so does Clean Code by Robert C. Martin. However, it's more important to be consistent within your project.

def url_path_join(*parts):
 """Normalize url parts and join them with a slash."""
 schemes, netlocs, paths, queries, fragments = zip(*(urlsplit(part) for part in parts))
 scheme = first(schemes)
 netloc = first(netlocs)
 path = '/'.join(x.strip('/') for x in paths if x)
 query = first(queries)
 fragment = first(fragments)
 return urlunsplit((scheme, netloc, path, query, fragment))
def first(sequence, default=''):
 return next((x for x in sequence if x), default)

If you're looking for something a bit more radical in nature, why not let first handle several sequences at once? (Note that unfortunately, you cannot combine default parameters with sequence-unpacking in Python 2.7, which has been fixed in Python 3.)

def url_path_join(*parts):
 """Normalize url parts and join them with a slash."""
 schemes, netlocs, paths, queries, fragments = zip(*(urlsplit(part) for part in parts))
 scheme, netloc, query, fragment = first_of_each(schemes, netlocs, queries, fragments)
 path = '/'.join(x.strip('/') for x in paths if x)
 return urlunsplit((scheme, netloc, path, query, fragment))
def first_of_each(*sequences):
 return (next((x for x in sequence if x), '') for sequence in sequences)

Question 8

I fully agree with Blender. Just use the os.path module. It provides a method to join paths and it also has methods to normalize pathnames (eg. os.path.normpath(pathname)) to use it on every OS with different separators.

Question 9

os.path.normpath does not use forward slashes on Windows by default, and it's not intelligent about keeping the double slashes after http://.

Question 10

@kojiro Yes, but you can import posixpath instead of import os.path to get the correct slashes. I do believe you are correct about the http:// though

Adam Adam 5,2161 gold badge30 silver badges47 bronze badges · Accepted Answer · 2013-03-27 12:16:14Z

I'd suggest the following improvements (in descending order of importance):

Extract your redundant generator expression to a function so it only occurs once. To preserve flexibility, introduce default as an optional parameter
This makes the comment redundant because first is a self-documenting name (you could call it first_or_default if you want to be more explicit), so you can remove that
Rephrase your docstring to make it more readable: normalize and with a slash don't make sense together
PEP 8 suggests not to align variable assignments, so does Clean Code by Robert C. Martin. However, it's more important to be consistent within your project.

def url_path_join(*parts):
 """Normalize url parts and join them with a slash."""
 schemes, netlocs, paths, queries, fragments = zip(*(urlsplit(part) for part in parts))
 scheme = first(schemes)
 netloc = first(netlocs)
 path = '/'.join(x.strip('/') for x in paths if x)
 query = first(queries)
 fragment = first(fragments)
 return urlunsplit((scheme, netloc, path, query, fragment))
def first(sequence, default=''):
 return next((x for x in sequence if x), default)

If you're looking for something a bit more radical in nature, why not let first handle several sequences at once? (Note that unfortunately, you cannot combine default parameters with sequence-unpacking in Python 2.7, which has been fixed in Python 3.)

def url_path_join(*parts):
 """Normalize url parts and join them with a slash."""
 schemes, netlocs, paths, queries, fragments = zip(*(urlsplit(part) for part in parts))
 scheme, netloc, query, fragment = first_of_each(schemes, netlocs, queries, fragments)
 path = '/'.join(x.strip('/') for x in paths if x)
 return urlunsplit((scheme, netloc, path, query, fragment))
def first_of_each(*sequences):
 return (next((x for x in sequence if x), '') for sequence in sequences)

Stack Exchange Network

Joining url path components intelligently

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Joining url path components intelligently

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions