1
\$\begingroup\$

I want to locate urls without protocols in the text, and then add the protocol before them. This means I don't want urls that begin with http(s):// or http(s)://www., only the kind of example.com. I'm aware that I might accidentally match with any text1.text2 if I forgot to add a space after a period, so I came up with some rules to make it more like an actual url:

(?<=^|\s)(\w*-?\w+\.[a-z]{2,}\S*)

  • (?<=^|\s) The URL should be after the newline or a space.
  • \w*-?\w+ The domain part, could have a dash (-) or not. Since it's after a newline or space, it removes the protocol.
  • [a-z]{2,} The extension, should be more than 2 letters
  • \S* The rest of the URL

It works well to match example.com or example.com/x1/x2 and not https://example.com. But I think it's a bit clumsy, and it fails if there is . or , after the url.

How can I achieve the same goal more elegantly? I don't need to match urls like 1.1.1.1. Are there some loopholes in the above rules that I haven't yet considered?

mdfst13
22.4k6 gold badges34 silver badges70 bronze badges
asked Aug 31, 2021 at 1:45
\$\endgroup\$
2
  • \$\begingroup\$ Duplicate of this stack overflow question: stackoverflow.com/questions/3809401/… \$\endgroup\$ Commented Sep 1, 2021 at 8:00
  • \$\begingroup\$ (?<!\S) in place of (?<=^|\s) (with this simple negation you avoid the alternation). If you want to avoid a dot (that ends a sentence), change the last \S* to \S*(?<![.]). (But whatever you do, don't dream, it can't be perfect even if your pattern fully and precisely describes the URL syntax. \$\endgroup\$ Commented Sep 2, 2021 at 17:29

1 Answer 1

1
\$\begingroup\$

For elegance, I would put at least one line of comment pointing to this; https://datatracker.ietf.org/doc/html/rfc1034#section-3.5

You will notice that there is no limit on the top level domain (Belgian, Dutch, and French domains would like to have a word with .be, .nl, and .fr)

It's unclear if your regex deals well with subdomains

Personally, I would break out the regex in to it's components, following the URL I provided.

answered Aug 31, 2021 at 9:37
\$\endgroup\$

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.