I made myself a function to check if a string is a path or an URL:
import os
def isPath(s):
"""
@param s string containing a path or url
@return True if it's a path, False if it's an url'
"""
if os.path.exists(s): # if a file with name s exists, we don't check any further and just return True
return True
elif s.startswith("/"): # clearly a path, urls never start with a slash
return True
elif "://" in s.split(".")[0]: # if a protocol is present, it's an url
return False
elif "localhost" in s: # special case for localhost domain name where splits on . would fail
return False
elif len(s.split("/")[0].split(".")) > 1: # dots before the first slash, normally separating TLD and domain name
return False
elif len(s.split("/")[0].split(":")) > 1: # if colons are present, either it's a IPv6 adress or there is a port number
return False
else: # all possible cases of an url checked, so it must be a path
return True
Did I miss any cases of an url / path, and can the function be improved in some way ?
2 Answers 2
Seems to me that the approach can be simplified, maybe. You could check if the string begins with http:// or https:// (+ ftp:// and more if you so wish).
FYI the startswith function can also accept a tuple of values, which can be useful for testing multiple possibilities in one pass. So I would start by checking this condition (protocol presence).
However my impression is that you also want to handle URLS that are not prefixed by a protocol eg: www.somesite.com. But this could perfectly be a directory or a file, you never know. Tools like wget or httrack will even create subdirectories named after the host names being crawled. Sqlmap too.
You could check with os.path.exists whether www.somesite.com actually exists on your system, but this would be against a relative path. What is it going to be ? The current working directory ?
Some of your assumptions are not safe, for example:
if colons are present, either it's a IPv6 adress or there is a port number
In Linux at least, file names can perfectly have colons. In Windows, it could be that this character is invalid in file names, I'm not sure and can't test right now. But there are differences between operating systems in this regard, and even between file systems.
-
\$\begingroup\$ "Some of your assumptions are not safe, for example: if colons are present, either it's a IPv6 adress or there is a port number" - It is safe, because i check in my code before that if a file with name
s
exists, in which case my code immediately says true (is a path). \$\endgroup\$TheEagle– TheEagle2021年04月03日 11:05:52 +00:00Commented Apr 3, 2021 at 11:05
Some high level comments:
- Checking if a string is a url (or a path) is usually done using regular expressions that can be found online. Not sure what's the scope of this function, but it can be seen as reinventing the wheel and might miss some uncommon cases
- The method name gives the impression that returning True means it's a path, and False means that it's not, where in fact False has the meaning of being a URL. I would split this function into two:
isPath
where False just means that the given string is not a Path, andisUrl
where False means it's not a URL. Or at the very least renaming it toisPathOrUrl
- It's a personal preference but I'd change
elif
to beif
as all the conditional bodies are returning something so no need to useelse
file:///path/to/file
is perfectly valid path, as is.dir/file
\$\endgroup\$