Regular expression to find URLs within a string

Question 1

Does anyone know of a regular expression I could use to find URLs within a string? I've found a lot of regular expressions on Google for determining if an entire string is a URL but I need to be able to search an entire string for URLs. For example, I would like to be able to find www.google.com and http://yahoo.com in the following string:

Hello www.google.com World http://yahoo.com

I am not looking for specific URLs in the string. I am looking for ALL of the URLs in the string which is why I need a regular expression.

Question 2

For PHP: preg_match_all('#\bhttps?://[^\s()<>]+(?:$[\w\d]+$|([^[:punct:]\s]|/))#', $string, $match); from stackoverflow.com/q/910912/1066234

Question 3

you're example missed the case of the protocol is not set //www.google.fr

Question 4

This is the one I use

(http|ftp|https):\/\/([\w_-]+(?:(?:\.[\w_-]+)+))([\w.,@?^=%&:\/~+#-]*[\w@?^=%&\/~+#-])

Works for me, should work for you too.

Question 5

Don't forget to escape the forward slashes.

Question 6

It's 2017, and unicode domain names are all over the place. \w may not match international symbols (depends on regex engine), the range is needed instead: a-zA-Z0-9\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF.

Question 7

This is fine for general purpose, but there are many cases that it doesn't catch. This enforces that your links are prefixed with a protocol. If choose to ignore protocols, endings of emails are accepted as it is the case with [email protected].

Question 8

shouldn't [\w_-] be [\w-]? because \w matches _ already. per mozilla docs

Question 9

Upvoted but This answer does not work what the question is asking www.yahoo.com.

"""(http|ftp|https)://([\w_-]+(?:(?:\.[\w_-]+)+))([\w.,@?^=%&:/~+#-]*[\w@?^=%&/~+#-])?""".r.findAllIn("www.google.com").toList

. ALSO LACKS EXPLANATION for answer

Question 10

Guess no regex is perfect for this use. I found a pretty solid one here

(?:(?:https?|ftp|file):\/\/|www\.|ftp\.)(?:\([-A-Z0-9+&@#\/%=~_|$?!:,.]*\)|[-A-Z0-9+&@#\/%=~_|$?!:,.])*(?:\([-A-Z0-9+&@#\/%=~_|$?!:,.]*\)|[A-Z0-9+&@#\/%=~_|$])

Some differences / advantages compared to the other ones posted here:

It does not match email addresses
It does match localhost:12345
It won't detect something like moo.com without http or www

See here for examples

Question 11

it matches www.e This is not a valid url

Question 12

The g option isn't valid in all regular expression implementations (e.g. Ruby's built-in implementation).

Question 13

you're regex missed the case of the protocol is not set //www.google.fr

Question 14

text = """The link of this question: https://stackoverflow.com/questions/6038061/regular-expression-to-find-urls-within-a-string
Also there are some urls: www.google.com, facebook.com, http://test.com/method?param=wasd, http://test.com/method?param=wasd&params2=kjhdkjshd
The code below catches all urls in text and returns urls in list."""
urls = re.findall('(?:(?:https?|ftp):\/\/)?[\w/\-?=%.]+\.[\w/\-&?=%.]+', text)
print(urls)

Output:

[
 'https://stackoverflow.com/questions/6038061/regular-expression-to-find-urls-within-a-string', 
 'www.google.com', 
 'facebook.com',
 'http://test.com/method?param=wasd',
 'http://test.com/method?param=wasd&params2=kjhdkjshd'
]

Question 15

Kotlin val urlRegex = "(?:(?:https?|ftp):\\/\\/)?[\\w/\\-?=%.]+\\.[\\w/\\-?=%.]+"

Question 16

Misses & parameters in the url. e.g. http://test.com/method?param=wasd&param2=wasd2 misses param2

Question 17

also lacks support for URLs with #

Question 18

@TrophyGeek I think you just copied the regex from the first comment, and Akshay forgot to include the &. The right version would be: val urlRegex = "(?:(?:https?|ftp):\\/\\/)?[\\w/\\-?=%.]+\\.[\\w/\\-&?=%.]+"

Question 19

This also thinks hello... is a URL

Question 20

Wrote one up myself:

let regex = /([\w+]+\:\/\/)?([\w\d-]+\.)*[\w-]+[\.\:]\w+([\/\?\=\&\#\.]?[\w-]+)*\/?/gm

It works on ALL of the following domains:

https://www.facebook.com
https://app-1.number123.com
http://facebook.com
ftp://facebook.com
http://localhost:3000
localhost:3000/
unitedkingdomurl.co.uk
this.is.a.url.com/its/still=going?wow
shop.facebook.org
app.number123.com
app1.number123.com
app-1.numbEr123.com
app.dashes-dash.com
www.facebook.com
facebook.com
fb.com/hello_123
fb.com/hel-lo
fb.com/hello/goodbye
fb.com/hello/goodbye?okay
fb.com/hello/goodbye?okay=alright
Hello www.google.com World http://yahoo.com
https://www.google.com.tr/admin/subPage?qs1=sss1&qs2=sss2&qs3=sss3#Services
https://google.com.tr/test/subPage?qs1=sss1&qs2=sss2&qs3=sss3#Services
http://google.com/test/subPage?qs1=sss1&qs2=sss2&qs3=sss3#Services
ftp://google.com/test/subPage?qs1=sss1&qs2=sss2&qs3=sss3#Services
www.google.com.tr/test/subPage?qs1=sss1&qs2=sss2&qs3=sss3#Services
www.google.com/test/subPage?qs1=sss1&qs2=sss2&qs3=sss3#Services
drive.google.com/test/subPage?qs1=sss1&qs2=sss2&qs3=sss3#Services
https://www.example.pl
http://www.example.com
www.example.pl
example.com
http://blog.example.com
http://www.example.com/product
http://www.example.com/products?id=1&page=2
http://www.example.com#up
http://255.255.255.255
255.255.255.255
shop.facebook.org/derf.html

You can see how it performs here on regex101 and adjust as needed

Question 21

Your regex missed this when I tested it. It only caught part of the URL: shop.facebook.org/derf.html

Question 22

@DavidRector Thanks! You are absolutely correct. I have updated the regex string and regex101 url based on your feedback. Added a \. at the end of the second last pair of square brackets [ ]

Question 23

This also matches any string of the form alphanum_char.alphanum_char, for example, a.r, b.4, 7.e, etc. These aren't valid URLs.

Question 24

Unfortunately this also matches times - 09:00

Question 25

your regex missed the case of the protocol is not specified as //www.leboncoin.fr

Question 26

None of the solutions provided here solved the problems/use-cases I had.

What I have provided here, is the best I have found/made so far. I will update it when I find new edge-cases that it doesn't handle.

\b
 #Word cannot begin with special characters
 (?<![@.,%&#-])
 #Protocols are optional, but take them with us if they are present
 (?<protocol>\w{2,10}:\/\/)?
 #Domains have to be of a length of 1 chars or greater
 ((?:\w|\&\#\d{1,5};)[.-]?)+
 #The domain ending has to be between 2 to 15 characters
 (\.([a-z]{2,15})
 #If no domain ending we want a port, only if a protocol is specified
 |(?(protocol)(?:\:\d{1,6})|(?!)))
\b
#Word cannot end with @ (made to catch emails)
(?![@])
#We accept any number of slugs, given we have a char after the slash
(\/)?
#If we have endings like ?=fds include the ending
(?:([\w\d\?\-=#:%@&.;])+(?:\/(?:([\w\d\?\-=#:%@&;.])+))*)?
#The last char cannot be one of these symbols .,?!,- exclude these
(?<![.,?!-])

Question 27

Is there any way to make this javascript friendly? As named capturing groups are not fully functional there, so the protocol value check does not validate.

Question 28

@einord, I know this is way late, but you can just remove the named portion of the capturing group and it works fine in JS.

/\b(?<![@.,%&#-])(\w{2,10}:\/\/)?((?:\w|&#\d{1,5};)[.-]?)+(\.([a-z]{2,15})|((?::\d{1,6})|(?!)))\b(?![@])(\/)?(?:([\w\d?\-=#:%@&.;])+(?:\/(?:([\w\d?\-=#:%@&;.])+))*)?(?<![.,?!-])/g

Question 29

I think this regex (regular expression) pattern handle precisely what you want

(http|https|ftp|ftps)\:\/\/[a-zA-Z0-9\-\.]+\.[a-zA-Z]{2,3}(\/\S*)?

and this is an snippet example to extract Urls:

// The Regular Expression filter
$reg_exUrl = "/(http|https|ftp|ftps)\:\/\/[a-zA-Z0-9\-\.]+\.[a-zA-Z]{2,3}(\/\S*)?/";
// The Text you want to filter for urls
$text = "The text you want https://stackoverflow.com/questions/6038061/regular-expression-to-find-urls-within-a-string to filter goes here.";
// Check if there is a url in the text
preg_match_all($reg_exUrl, $text, $url,$matches);
var_dump($matches);

Question 30

you're regex missed the case of the protocol is not set //www.google.fr

Question 31

If you have to be strict on selecting links, I would go for:

(?i)\b((?:[a-z][\w-]+:(?:/{1,3}|[a-z0-9%])|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:'".,<>?«»""‘’]))

For more infos, read this:

An Improved Liberal, Accurate Regex Pattern for Matching URLs

Question 32

Don't do that. regular-expressions.info/catastrophic.html It'll kill your app...

Question 33

All of the above answers are not match for Unicode characters in URL, for example: http://google.com?query=đức+filan+đã+search

For the solution, this one should work:

(ftp:\/\/|www\.|https?:\/\/){1}[a-zA-Z0-9u00a1-\uffff0-]{2,}\.[a-zA-Z0-9u00a1-\uffff0-]{2,}(\S*)

Question 34

Unicode characters were forbidden as per the RFC 1738 on URLs (faqs.org/rfcs/rfc1738.html). They would have to be percent encoded to be standards compliant - although I think it may have changed more recently - worth reading w3.org/International/articles/idn-and-iri

Question 35

@mrswadge I just cover the cases. We're not sure if all people care about the standard. Thank you for your info.

Question 36

Only this one worked perfectly for me having urls such as "example.com" "www.exmaple.com" "example.com" "example.co.in" "exmaple.com/?q='me'"

Question 37

you're regex missed the case of the protocol is not set //www.google.fr

Question 38

@Adrien Parrochia I don't think it's valid, isn't it?

Question 39

I found this which covers most sample links, including subdirectory parts.

Regex is:

(?:(?:https?|ftp):\/\/|\b(?:[a-z\d]+\.))(?:(?:[^\s()<>]+|\((?:[^\s()<>]+|(?:\([^\s()<>]+\)))?\))+(?:\((?:[^\s()<>]+|(?:\(?:[^\s()<>]+\)))?\)|[^\s`!()\[\]{};:'".,<>?«»""‘’]))?

Question 40

When I tried this, the ends of sentences were marked as a match. In the above sentence, the last word "match" and the period were matched.

Question 41

I used the regular expression below to find the url in a string:

(http|https)\:\/\/[a-zA-Z0-9\-\.]+\.[a-zA-Z]{2,3}(\/\S*)?

Question 42

[a-zA-Z]{2,3} is really poor for matching TLD, see official list: data.iana.org/TLD/tlds-alpha-by-domain.txt

Question 43

you're regex missed the case of the protocol is not set //www.google.fr

Question 44

IMPROVED

Detects Urls like these:

https://www.example.pl
http://www.example.com
www.example.pl
example.com
http://blog.example.com
http://www.example.com/product
http://www.example.com/products?id=1&page=2
http://www.example.com#up
http://255.255.255.255
255.255.255.255
http:// www.site.com:8008

Regex:

/^(?:http(s)?:\/\/)?[\w.-]+(?:\.[\w\.-]+)+[\w\-\._~:/?#[\]@!\$&'\(\)\*\+,;=.]+$/gm

Please note that working with URLs and domain validation can be complex, and regex alone may not cover all edge cases. For more comprehensive URL validation, it's recommended to use specialized libraries or built-in URL validation functions provided by your programming language or framework.

Question 45

This will detect some expression such as "A.D." or "B.C." as urls though.

Question 46

you're regex missed the case of the protocol is not set //www.google.fr

Question 47

Short and simple. I have not tested in javascript code yet but It looks it will work:

((http|ftp|https):\/\/)?(([\w.-]*)\.([\w]*))

Code on regex101.com

Code preview

Question 48

I liked your regex because it was exactly what I was looking for: I needed to identify and strip URLs out of some text, not validate. Worked in rails.

Question 49

@Dagmar I am glad to hear that :)

Question 50

you're regex missed the case of the protocol is not set //www.google.fr

Question 51

You are right @AdrienParrochia. When I posted this, I didn't check for it. Maybe I can do it later.

Question 52

Using the regex provided by @JustinLevene did not have the proper escape sequences on the back-slashes. Updated to now be correct, and added in condition to match the FTP protocol as well: Will match to all urls with or without protocols, and with out without "www."

Code: ^((http|ftp|https):\/\/)?([\w_-]+(?:(?:\.[\w_-]+)+))([\w.,@?^=%&:\/~+#-]*[\w@?^=%&\/~+#-])?

Example: https://regex101.com/r/uQ9aL4/65

Question 53

Here a little bit more optimized regexp:

(?:(?:(https?|ftp|file):\/\/|www\.|ftp\.)|([\w\-_]+(?:\.|\s*\[dot\]\s*[A-Z\-_]+)+))([A-Z\-\.,@?^=%&amp;:\/~\+#]*[A-Z\-\@?^=%&amp;\/~\+#]){2,6}?

Here is test with data: https://regex101.com/r/sFzzpY/6

enter image description here

Question 54

Your test shows some of your URL's are not being detected fully. This entire string should be marked as a match: stackoverflow.com/questions/60619430/…

Question 55

you're regex missed the case of the protocol is not set : //www.google.fr

Question 56

Wasn't easy one, but managed to compose a short and efficient regex pattern to match URLs, also captures email addresses. Hope that works for you.

((\bhttp(|s)|ftp|file):\/\/)|\bwww[ ]*\.[ ]*([a-zA-Z0-9%:?#@\/=_-]*)|([a-zA-Z0-9%:.?#@\/=_-]*)[ ]*\.[ ]*(com|eu|org|co|uk|pdf|etc)

This can be tested here regexr.com

Question 57

you're regex missed the case of the protocol is not set //www.google.fr

Question 58

If you have the url pattern, you should be able to search for it in your string. Just make sure that the pattern doesnt have ^ and $ marking beginning and end of the url string. So if P is the pattern for URL, look for matches for P.

Question 59

This is the regex I found that verifies if an entire string is a URL. I took out the ^ at the beggining and the $ at the end like you said and it still didn't work. What am I doing wrong?

^(http|https|ftp)\://[a-zA-Z0-9\-\.]+\.[a-zA-Z]{2,3}(:[a-zA-Z0-9]*)?/?([a-zA-Z0-9\-\._\?,円\'/\\\+&amp;%\$#\=~])*[^\.,円\)\(\s]$

Question 60

It might help if you showed what language you're using. Either way, be sure to check http://regexpal.com/; there you can test different expressions against your string until you get it right.

Question 61

@user758263 - do you really need such a complex regex for the url? Depends on what the possible urls you might actually find. Also see gskinner.com/RegExr for trying out regex. They also have hundreds of samples on the right under the Community tab including ones for urls

Question 62

I'm trying to look for all possible URLs and I'm using C++. Thanks for the links entonio and manojlds. The gskinner site was especially helpful since it had samples.

Question 63

A probably too simplistic, but working method might be:

[localhost|http|https|ftp|file]+://[\w\S(\.|:|/)]+

I tested it on Python and as long as the string parsing contains a space before and after and none in the url (which I have never seen before) it should be fine.

Here is an online ide demonstrating it

However here are some benefits of using it:

It recognises file: and localhost as well as ip addresses
It will never match without them
It does not mind unusual characters such as # or - (see url of this post)

Question 64

you're regex missed the case of the protocol is not set //www.google.fr

Rajeev Rajeev 4,8792 gold badges27 silver badges35 bronze badges · Accepted Answer · 2011-05-18 08:37:53Z

311

This is the one I use

(http|ftp|https):\/\/([\w_-]+(?:(?:\.[\w_-]+)+))([\w.,@?^=%&:\/~+#-]*[\w@?^=%&\/~+#-])

Works for me, should work for you too.

Share

Improve this answer

edited Oct 3, 2021 at 13:11

Adam's user avatar

Adam

6,23139 gold badges130 silver badges224 bronze badges

answered May 18, 2011 at 8:37

Rajeev's user avatar

Rajeev Rajeev

4,8792 gold badges27 silver badges35 bronze badges

15

12

Don't forget to escape the forward slashes.

Mark
– Mark

2017年07月08日 06:53:02 +00:00
Commented Jul 8, 2017 at 6:53
4

It's 2017, and unicode domain names are all over the place. \w may not match international symbols (depends on regex engine), the range is needed instead: a-zA-Z0-9\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF.

Michael Antipin
– Michael Antipin

2017年08月29日 13:34:50 +00:00
Commented Aug 29, 2017 at 13:34
6

This is fine for general purpose, but there are many cases that it doesn't catch. This enforces that your links are prefixed with a protocol. If choose to ignore protocols, endings of emails are accepted as it is the case with [email protected].

Squazz
– Squazz

2017年09月07日 08:09:44 +00:00
Commented Sep 7, 2017 at 8:09
8

shouldn't [\w_-] be [\w-]? because \w matches _ already. per mozilla docs

Sang
– Sang

2017年11月04日 07:19:24 +00:00
Commented Nov 4, 2017 at 7:19
12

Upvoted but This answer does not work what the question is asking www.yahoo.com. """(http|ftp|https)://([\w_-]+(?:(?:\.[\w_-]+)+))([\w.,@?^=%&:/~+#-]*[\w@?^=%&/~+#-])?""".r.findAllIn("www.google.com").toList . ALSO LACKS EXPLANATION for answer

prayagupadhyay
– prayagupadhyay

2017年11月11日 23:58:15 +00:00
Commented Nov 11, 2017 at 23:58

| Show 10 more comments

CollectivesTM on Stack Overflow

Regular expression to find URLs within a string

35 Answers 35

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Hot Network Questions

CollectivesTM on Stack Overflow

35 Answers 35

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Related