I want to locate urls without protocols in the text, and then add the protocol before them. This means I don't want urls that begin with http(s)://
or http(s)://www.
, only the kind of example.com
. I'm aware that I might accidentally match with any text1.text2
if I forgot to add a space after a period, so I came up with some rules to make it more like an actual url:
(?<=^|\s)(\w*-?\w+\.[a-z]{2,}\S*)
(?<=^|\s)
The URL should be after the newline or a space.\w*-?\w+
The domain part, could have a dash (-) or not. Since it's after a newline or space, it removes the protocol.[a-z]{2,}
The extension, should be more than 2 letters\S*
The rest of the URL
It works well to match example.com
or example.com/x1/x2
and not https://example.com
. But I think it's a bit clumsy, and it fails if there is . or , after the url.
How can I achieve the same goal more elegantly? I don't need to match urls like 1.1.1.1
. Are there some loopholes in the above rules that I haven't yet considered?
1 Answer 1
For elegance, I would put at least one line of comment pointing to this; https://datatracker.ietf.org/doc/html/rfc1034#section-3.5
You will notice that there is no limit on the top level domain (Belgian, Dutch, and French domains would like to have a word with .be, .nl, and .fr)
It's unclear if your regex deals well with subdomains
Personally, I would break out the regex in to it's components, following the URL I provided.
(?<!\S)
in place of(?<=^|\s)
(with this simple negation you avoid the alternation). If you want to avoid a dot (that ends a sentence), change the last\S*
to\S*(?<![.])
. (But whatever you do, don't dream, it can't be perfect even if your pattern fully and precisely describes the URL syntax. \$\endgroup\$