and www

Asked 4 years ago

Viewed 3k times

\$\begingroup\$

I want to locate urls without protocols in the text, and then add the protocol before them. This means I don't want urls that begin with http(s):// or http(s)://www., only the kind of example.com. I'm aware that I might accidentally match with any text1.text2 if I forgot to add a space after a period, so I came up with some rules to make it more like an actual url:

(?<=^|\s)(\w*-?\w+\.[a-z]{2,}\S*)

(?<=^|\s) The URL should be after the newline or a space.
\w*-?\w+ The domain part, could have a dash (-) or not. Since it's after a newline or space, it removes the protocol.
[a-z]{2,} The extension, should be more than 2 letters
\S* The rest of the URL

It works well to match example.com or example.com/x1/x2 and not https://example.com. But I think it's a bit clumsy, and it fails if there is . or , after the url.

How can I achieve the same goal more elegantly? I don't need to match urls like 1.1.1.1. Are there some loopholes in the above rules that I haven't yet considered?

edited Aug 31, 2021 at 1:53

mdfst13's user avatar

mdfst13

22.4k6 gold badges34 silver badges70 bronze badges

asked Aug 31, 2021 at 1:45

tnthpp66's user avatar

tnthpp66 tnthpp66

111 silver badge2 bronze badges

\$\endgroup\$

\$\begingroup\$ Duplicate of this stack overflow question: stackoverflow.com/questions/3809401/… \$\endgroup\$

Zachary Vance
– Zachary Vance

2021年09月01日 08:00:50 +00:00
Commented Sep 1, 2021 at 8:00
\$\begingroup\$ (?<!\S) in place of (?<=^|\s) (with this simple negation you avoid the alternation). If you want to avoid a dot (that ends a sentence), change the last \S* to \S*(?<![.]). (But whatever you do, don't dream, it can't be perfect even if your pattern fully and precisely describes the URL syntax. \$\endgroup\$

Casimir et Hippolyte
– Casimir et Hippolyte

2021年09月02日 17:29:39 +00:00
Commented Sep 2, 2021 at 17:29

Add a comment |

1 Answer 1

Sorted by: Reset to default

\$\begingroup\$

For elegance, I would put at least one line of comment pointing to this; https://datatracker.ietf.org/doc/html/rfc1034#section-3.5

You will notice that there is no limit on the top level domain (Belgian, Dutch, and French domains would like to have a word with .be, .nl, and .fr)

It's unclear if your regex deals well with subdomains

Personally, I would break out the regex in to it's components, following the URL I provided.

answered Aug 31, 2021 at 9:37

konijn's user avatar

konijn konijn

34.2k5 gold badges70 silver badges267 bronze badges

\$\endgroup\$

Add a comment |

Your Answer

Draft saved

Draft discarded

Sign up or log in

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.

default

Stack Exchange Network

Regex detect URL without http(s):// and www

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Regex detect URL without http(s):// and www

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions