Does anyone know of a regular expression I could use to find URLs within a string? I've found a lot of regular expressions on Google for determining if an entire string is a URL but I need to be able to search an entire string for URLs. For example, I would like to be able to find www.google.com
and http://yahoo.com
in the following string:
Hello www.google.com World http://yahoo.com
I am not looking for specific URLs in the string. I am looking for ALL of the URLs in the string which is why I need a regular expression.
35 Answers 35
This is the one I use
(http|ftp|https):\/\/([\w_-]+(?:(?:\.[\w_-]+)+))([\w.,@?^=%&:\/~+#-]*[\w@?^=%&\/~+#-])
Works for me, should work for you too.
-
12Don't forget to escape the forward slashes.Mark– Mark2017年07月08日 06:53:02 +00:00Commented Jul 8, 2017 at 6:53
-
4It's 2017, and unicode domain names are all over the place.
\w
may not match international symbols (depends on regex engine), the range is needed instead:a-zA-Z0-9\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF
.Michael Antipin– Michael Antipin2017年08月29日 13:34:50 +00:00Commented Aug 29, 2017 at 13:34 -
6This is fine for general purpose, but there are many cases that it doesn't catch. This enforces that your links are prefixed with a protocol. If choose to ignore protocols, endings of emails are accepted as it is the case with [email protected].Squazz– Squazz2017年09月07日 08:09:44 +00:00Commented Sep 7, 2017 at 8:09
-
8shouldn't
[\w_-]
be[\w-]
? because\w
matches_
already. per mozilla docsSang– Sang2017年11月04日 07:19:24 +00:00Commented Nov 4, 2017 at 7:19 -
12Upvoted but This answer does not work what the question is asking
www.yahoo.com
."""(http|ftp|https)://([\w_-]+(?:(?:\.[\w_-]+)+))([\w.,@?^=%&:/~+#-]*[\w@?^=%&/~+#-])?""".r.findAllIn("www.google.com").toList
. ALSO LACKS EXPLANATION for answerprayagupadhyay– prayagupadhyay2017年11月11日 23:58:15 +00:00Commented Nov 11, 2017 at 23:58
Guess no regex is perfect for this use. I found a pretty solid one here
(?:(?:https?|ftp|file):\/\/|www\.|ftp\.)(?:\([-A-Z0-9+&@#\/%=~_|$?!:,.]*\)|[-A-Z0-9+&@#\/%=~_|$?!:,.])*(?:\([-A-Z0-9+&@#\/%=~_|$?!:,.]*\)|[A-Z0-9+&@#\/%=~_|$])
Some differences / advantages compared to the other ones posted here:
- It does not match email addresses
- It does match localhost:12345
- It won't detect something like
moo.com
withouthttp
orwww
See here for examples
-
8it matches www.e This is not a valid urlIhor Herasymchuk– Ihor Herasymchuk2016年12月20日 22:46:41 +00:00Commented Dec 20, 2016 at 22:46
-
3The
g
option isn't valid in all regular expression implementations (e.g. Ruby's built-in implementation).Huliax– Huliax2020年01月17日 13:23:08 +00:00Commented Jan 17, 2020 at 13:23 -
you're regex missed the case of the protocol is not set //www.google.frAdrien Parrochia– Adrien Parrochia2024年01月05日 10:49:34 +00:00Commented Jan 5, 2024 at 10:49
text = """The link of this question: https://stackoverflow.com/questions/6038061/regular-expression-to-find-urls-within-a-string
Also there are some urls: www.google.com, facebook.com, http://test.com/method?param=wasd, http://test.com/method?param=wasd¶ms2=kjhdkjshd
The code below catches all urls in text and returns urls in list."""
urls = re.findall('(?:(?:https?|ftp):\/\/)?[\w/\-?=%.]+\.[\w/\-&?=%.]+', text)
print(urls)
Output:
[
'https://stackoverflow.com/questions/6038061/regular-expression-to-find-urls-within-a-string',
'www.google.com',
'facebook.com',
'http://test.com/method?param=wasd',
'http://test.com/method?param=wasd¶ms2=kjhdkjshd'
]
-
Kotlin val urlRegex = "(?:(?:https?|ftp):\\/\\/)?[\\w/\\-?=%.]+\\.[\\w/\\-?=%.]+"Akshay Nandwana– Akshay Nandwana2019年02月27日 06:32:10 +00:00Commented Feb 27, 2019 at 6:32
-
2Misses
&
parameters in the url. e.g.http://test.com/method?param=wasd¶m2=wasd2
misses param2TrophyGeek– TrophyGeek2019年05月18日 21:38:33 +00:00Commented May 18, 2019 at 21:38 -
1also lacks support for URLs with #nicolasassi– nicolasassi2020年12月22日 19:53:50 +00:00Commented Dec 22, 2020 at 19:53
-
@TrophyGeek I think you just copied the regex from the first comment, and Akshay forgot to include the
&
. The right version would be:val urlRegex = "(?:(?:https?|ftp):\\/\\/)?[\\w/\\-?=%.]+\\.[\\w/\\-&?=%.]+"
Alec– Alec2022年01月21日 17:16:33 +00:00Commented Jan 21, 2022 at 17:16 -
1This also thinks
hello...
is a URLlukasniessen– lukasniessen2022年03月23日 08:12:58 +00:00Commented Mar 23, 2022 at 8:12
Wrote one up myself:
let regex = /([\w+]+\:\/\/)?([\w\d-]+\.)*[\w-]+[\.\:]\w+([\/\?\=\&\#\.]?[\w-]+)*\/?/gm
It works on ALL of the following domains:
https://www.facebook.com
https://app-1.number123.com
http://facebook.com
ftp://facebook.com
http://localhost:3000
localhost:3000/
unitedkingdomurl.co.uk
this.is.a.url.com/its/still=going?wow
shop.facebook.org
app.number123.com
app1.number123.com
app-1.numbEr123.com
app.dashes-dash.com
www.facebook.com
facebook.com
fb.com/hello_123
fb.com/hel-lo
fb.com/hello/goodbye
fb.com/hello/goodbye?okay
fb.com/hello/goodbye?okay=alright
Hello www.google.com World http://yahoo.com
https://www.google.com.tr/admin/subPage?qs1=sss1&qs2=sss2&qs3=sss3#Services
https://google.com.tr/test/subPage?qs1=sss1&qs2=sss2&qs3=sss3#Services
http://google.com/test/subPage?qs1=sss1&qs2=sss2&qs3=sss3#Services
ftp://google.com/test/subPage?qs1=sss1&qs2=sss2&qs3=sss3#Services
www.google.com.tr/test/subPage?qs1=sss1&qs2=sss2&qs3=sss3#Services
www.google.com/test/subPage?qs1=sss1&qs2=sss2&qs3=sss3#Services
drive.google.com/test/subPage?qs1=sss1&qs2=sss2&qs3=sss3#Services
https://www.example.pl
http://www.example.com
www.example.pl
example.com
http://blog.example.com
http://www.example.com/product
http://www.example.com/products?id=1&page=2
http://www.example.com#up
http://255.255.255.255
255.255.255.255
shop.facebook.org/derf.html
You can see how it performs here on regex101 and adjust as needed
-
Your regex missed this when I tested it. It only caught part of the URL: shop.facebook.org/derf.htmlDavid Rector– David Rector2021年02月23日 07:50:48 +00:00Commented Feb 23, 2021 at 7:50
-
1@DavidRector Thanks! You are absolutely correct. I have updated the regex string and regex101 url based on your feedback. Added a \. at the end of the second last pair of square brackets [ ]wongx– wongx2021年02月24日 02:39:34 +00:00Commented Feb 24, 2021 at 2:39
-
7This also matches any string of the form
alphanum_char.alphanum_char
, for example,a.r
,b.4
,7.e
, etc. These aren't valid URLs.Princy– Princy2021年06月17日 23:24:53 +00:00Commented Jun 17, 2021 at 23:24 -
2Unfortunately this also matches times - 09:00Mike Kaply– Mike Kaply2021年06月21日 15:26:47 +00:00Commented Jun 21, 2021 at 15:26
-
your regex missed the case of the protocol is not specified as //www.leboncoin.frAdrien Parrochia– Adrien Parrochia2024年01月05日 10:43:25 +00:00Commented Jan 5, 2024 at 10:43
None of the solutions provided here solved the problems/use-cases I had.
What I have provided here, is the best I have found/made so far. I will update it when I find new edge-cases that it doesn't handle.
\b
#Word cannot begin with special characters
(?<![@.,%&#-])
#Protocols are optional, but take them with us if they are present
(?<protocol>\w{2,10}:\/\/)?
#Domains have to be of a length of 1 chars or greater
((?:\w|\&\#\d{1,5};)[.-]?)+
#The domain ending has to be between 2 to 15 characters
(\.([a-z]{2,15})
#If no domain ending we want a port, only if a protocol is specified
|(?(protocol)(?:\:\d{1,6})|(?!)))
\b
#Word cannot end with @ (made to catch emails)
(?![@])
#We accept any number of slugs, given we have a char after the slash
(\/)?
#If we have endings like ?=fds include the ending
(?:([\w\d\?\-=#:%@&.;])+(?:\/(?:([\w\d\?\-=#:%@&;.])+))*)?
#The last char cannot be one of these symbols .,?!,- exclude these
(?<![.,?!-])
-
1Is there any way to make this javascript friendly? As named capturing groups are not fully functional there, so the protocol value check does not validate.einord– einord2020年02月07日 07:26:45 +00:00Commented Feb 7, 2020 at 7:26
-
2@einord, I know this is way late, but you can just remove the named portion of the capturing group and it works fine in JS.
/\b(?<![@.,%&#-])(\w{2,10}:\/\/)?((?:\w|&#\d{1,5};)[.-]?)+(\.([a-z]{2,15})|((?::\d{1,6})|(?!)))\b(?![@])(\/)?(?:([\w\d?\-=#:%@&.;])+(?:\/(?:([\w\d?\-=#:%@&;.])+))*)?(?<![.,?!-])/g
ThePuzzleMaster– ThePuzzleMaster2024年10月24日 18:04:41 +00:00Commented Oct 24, 2024 at 18:04
I think this regex (regular expression) pattern handle precisely what you want
(http|https|ftp|ftps)\:\/\/[a-zA-Z0-9\-\.]+\.[a-zA-Z]{2,3}(\/\S*)?
and this is an snippet example to extract Urls:
// The Regular Expression filter
$reg_exUrl = "/(http|https|ftp|ftps)\:\/\/[a-zA-Z0-9\-\.]+\.[a-zA-Z]{2,3}(\/\S*)?/";
// The Text you want to filter for urls
$text = "The text you want https://stackoverflow.com/questions/6038061/regular-expression-to-find-urls-within-a-string to filter goes here.";
// Check if there is a url in the text
preg_match_all($reg_exUrl, $text, $url,$matches);
var_dump($matches);
-
you're regex missed the case of the protocol is not set //www.google.frAdrien Parrochia– Adrien Parrochia2024年01月05日 10:48:42 +00:00Commented Jan 5, 2024 at 10:48
If you have to be strict on selecting links, I would go for:
(?i)\b((?:[a-z][\w-]+:(?:/{1,3}|[a-z0-9%])|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:'".,<>?«»""‘’]))
For more infos, read this:
An Improved Liberal, Accurate Regex Pattern for Matching URLs
-
5Don't do that. regular-expressions.info/catastrophic.html It'll kill your app...Auric– Auric2017年11月28日 19:22:04 +00:00Commented Nov 28, 2017 at 19:22
All of the above answers are not match for Unicode characters in URL, for example: http://google.com?query=đức+filan+đã+search
For the solution, this one should work:
(ftp:\/\/|www\.|https?:\/\/){1}[a-zA-Z0-9u00a1-\uffff0-]{2,}\.[a-zA-Z0-9u00a1-\uffff0-]{2,}(\S*)
-
2Unicode characters were forbidden as per the RFC 1738 on URLs (faqs.org/rfcs/rfc1738.html). They would have to be percent encoded to be standards compliant - although I think it may have changed more recently - worth reading w3.org/International/articles/idn-and-irimrswadge– mrswadge2016年09月07日 09:41:49 +00:00Commented Sep 7, 2016 at 9:41
-
@mrswadge I just cover the cases. We're not sure if all people care about the standard. Thank you for your info.Duc Filan– Duc Filan2016年09月12日 02:54:53 +00:00Commented Sep 12, 2016 at 2:54
-
1Only this one worked perfectly for me having urls such as "example.com" "www.exmaple.com" "example.com" "example.co.in" "exmaple.com/?q='me'"Krissh– Krissh2020年01月30日 07:43:42 +00:00Commented Jan 30, 2020 at 7:43
-
you're regex missed the case of the protocol is not set //www.google.frAdrien Parrochia– Adrien Parrochia2024年01月05日 10:48:27 +00:00Commented Jan 5, 2024 at 10:48
-
@Adrien Parrochia I don't think it's valid, isn't it?Duc Filan– Duc Filan2024年01月08日 10:12:40 +00:00Commented Jan 8, 2024 at 10:12
I found this which covers most sample links, including subdirectory parts.
Regex is:
(?:(?:https?|ftp):\/\/|\b(?:[a-z\d]+\.))(?:(?:[^\s()<>]+|\((?:[^\s()<>]+|(?:\([^\s()<>]+\)))?\))+(?:\((?:[^\s()<>]+|(?:\(?:[^\s()<>]+\)))?\)|[^\s`!()\[\]{};:'".,<>?«»""‘’]))?
-
When I tried this, the ends of sentences were marked as a match. In the above sentence, the last word "match" and the period were matched.David Rector– David Rector2021年02月23日 07:55:01 +00:00Commented Feb 23, 2021 at 7:55
I used the regular expression below to find the url in a string:
(http|https)\:\/\/[a-zA-Z0-9\-\.]+\.[a-zA-Z]{2,3}(\/\S*)?
-
3
[a-zA-Z]{2,3}
is really poor for matching TLD, see official list: data.iana.org/TLD/tlds-alpha-by-domain.txtToto– Toto2015年01月19日 11:04:15 +00:00Commented Jan 19, 2015 at 11:04 -
you're regex missed the case of the protocol is not set //www.google.frAdrien Parrochia– Adrien Parrochia2024年01月05日 10:48:12 +00:00Commented Jan 5, 2024 at 10:48
IMPROVED
Detects Urls like these:
- https://www.example.pl
- http://www.example.com
- www.example.pl
- example.com
- http://blog.example.com
- http://www.example.com/product
- http://www.example.com/products?id=1&page=2
- http://www.example.com#up
- http://255.255.255.255
- 255.255.255.255
- http:// www.site.com:8008
Regex:
/^(?:http(s)?:\/\/)?[\w.-]+(?:\.[\w\.-]+)+[\w\-\._~:/?#[\]@!\$&'\(\)\*\+,;=.]+$/gm
Please note that working with URLs and domain validation can be complex, and regex alone may not cover all edge cases. For more comprehensive URL validation, it's recommended to use specialized libraries or built-in URL validation functions provided by your programming language or framework.
-
This will detect some expression such as "A.D." or "B.C." as urls though.Guillaume F.– Guillaume F.2023年12月17日 10:20:33 +00:00Commented Dec 17, 2023 at 10:20
-
you're regex missed the case of the protocol is not set //www.google.frAdrien Parrochia– Adrien Parrochia2024年01月05日 10:47:25 +00:00Commented Jan 5, 2024 at 10:47
Short and simple. I have not tested in javascript code yet but It looks it will work:
((http|ftp|https):\/\/)?(([\w.-]*)\.([\w]*))
-
1I liked your regex because it was exactly what I was looking for: I needed to identify and strip URLs out of some text, not validate. Worked in rails.Dagmar– Dagmar2019年08月16日 05:17:40 +00:00Commented Aug 16, 2019 at 5:17
-
@Dagmar I am glad to hear that :)bafsar– bafsar2019年08月17日 02:18:07 +00:00Commented Aug 17, 2019 at 2:18
-
1you're regex missed the case of the protocol is not set //www.google.frAdrien Parrochia– Adrien Parrochia2024年01月05日 10:46:45 +00:00Commented Jan 5, 2024 at 10:46
-
You are right @AdrienParrochia. When I posted this, I didn't check for it. Maybe I can do it later.bafsar– bafsar2024年01月06日 22:53:50 +00:00Commented Jan 6, 2024 at 22:53
Using the regex provided by @JustinLevene did not have the proper escape sequences on the back-slashes. Updated to now be correct, and added in condition to match the FTP protocol as well: Will match to all urls with or without protocols, and with out without "www."
Code: ^((http|ftp|https):\/\/)?([\w_-]+(?:(?:\.[\w_-]+)+))([\w.,@?^=%&:\/~+#-]*[\w@?^=%&\/~+#-])?
Example: https://regex101.com/r/uQ9aL4/65
Here a little bit more optimized regexp:
(?:(?:(https?|ftp|file):\/\/|www\.|ftp\.)|([\w\-_]+(?:\.|\s*\[dot\]\s*[A-Z\-_]+)+))([A-Z\-\.,@?^=%&:\/~\+#]*[A-Z\-\@?^=%&\/~\+#]){2,6}?
Here is test with data: https://regex101.com/r/sFzzpY/6
-
1Your test shows some of your URL's are not being detected fully. This entire string should be marked as a match: stackoverflow.com/questions/60619430/…David Rector– David Rector2021年02月23日 07:53:30 +00:00Commented Feb 23, 2021 at 7:53
-
you're regex missed the case of the protocol is not set : //www.google.frAdrien Parrochia– Adrien Parrochia2024年01月05日 10:47:00 +00:00Commented Jan 5, 2024 at 10:47
Wasn't easy one, but managed to compose a short and efficient regex pattern to match URLs, also captures email addresses. Hope that works for you.
((\bhttp(|s)|ftp|file):\/\/)|\bwww[ ]*\.[ ]*([a-zA-Z0-9%:?#@\/=_-]*)|([a-zA-Z0-9%:.?#@\/=_-]*)[ ]*\.[ ]*(com|eu|org|co|uk|pdf|etc)
This can be tested here regexr.com
-
you're regex missed the case of the protocol is not set //www.google.frAdrien Parrochia– Adrien Parrochia2024年01月05日 10:47:09 +00:00Commented Jan 5, 2024 at 10:47
If you have the url pattern, you should be able to search for it in your string. Just make sure that the pattern doesnt have ^
and $
marking beginning and end of the url string. So if P is the pattern for URL, look for matches for P.
-
This is the regex I found that verifies if an entire string is a URL. I took out the ^ at the beggining and the $ at the end like you said and it still didn't work. What am I doing wrong?
^(http|https|ftp)\://[a-zA-Z0-9\-\.]+\.[a-zA-Z]{2,3}(:[a-zA-Z0-9]*)?/?([a-zA-Z0-9\-\._\?,円\'/\\\+&%\$#\=~])*[^\.,円\)\(\s]$
user758263– user7582632011年05月17日 23:19:58 +00:00Commented May 17, 2011 at 23:19 -
It might help if you showed what language you're using. Either way, be sure to check
http://regexpal.com/
; there you can test different expressions against your string until you get it right.entonio– entonio2011年05月17日 23:37:12 +00:00Commented May 17, 2011 at 23:37 -
@user758263 - do you really need such a complex regex for the url? Depends on what the possible urls you might actually find. Also see gskinner.com/RegExr for trying out regex. They also have hundreds of samples on the right under the
Community
tab including ones for urlsmanojlds– manojlds2011年05月18日 00:06:47 +00:00Commented May 18, 2011 at 0:06 -
I'm trying to look for all possible URLs and I'm using C++. Thanks for the links entonio and manojlds. The gskinner site was especially helpful since it had samples.user758263– user7582632011年05月18日 15:11:50 +00:00Commented May 18, 2011 at 15:11
A probably too simplistic, but working method might be:
[localhost|http|https|ftp|file]+://[\w\S(\.|:|/)]+
I tested it on Python and as long as the string parsing contains a space before and after and none in the url (which I have never seen before) it should be fine.
Here is an online ide demonstrating it
However here are some benefits of using it:
- It recognises
file:
andlocalhost
as well as ip addresses - It will never match without them
- It does not mind unusual characters such as
#
or-
(see url of this post)
-
you're regex missed the case of the protocol is not set //www.google.frAdrien Parrochia– Adrien Parrochia2024年01月05日 10:47:32 +00:00Commented Jan 5, 2024 at 10:47
-
[localhost|http|https|ftp|file]+
is a character class. I guess you wanted to use a group:(?:localhost|http|https|ftp|file)
further\S
non white space already includes\w
,.
,:
,/
and even(
,)
. You can as well use\b(?:https?|localhost|ftp|file)://\S+
, it would match at least what probably was meant to match.bobble bubble– bobble bubble2024年06月04日 18:10:42 +00:00Commented Jun 4, 2024 at 18:10
I liked Stefan Henze 's solution but it would pick up 34.56. Its too general and I have unparsed html. There are 4 anchors for a url;
www ,
http:\ (and co) ,
. followed by letters and then / ,
or letters . and one of these: https://ftp.isc.org/www/survey/reports/current/bynum.txt .
I used lots of info from this thread. Thank you all.
"(((((http|ftp|https|gopher|telnet|file|localhost):\\/\\/)|(www\\.)|(xn--)){1}([\\w_-]+(?:(?:\\.[\\w_-]+)+))([\\w.,@?^=%&:\\/~+#-]*[\\w@?^=%&\\/~+#-])?)|(([\\w_-]{2,200}(?:(?:\\.[\\w_-]+)*))((\\.[\\w_-]+\\/([\\w.,@?^=%&:\\/~+#-]*[\\w@?^=%&\\/~+#-])?)|(\\.((org|com|net|edu|gov|mil|int|arpa|biz|info|unknown|one|ninja|network|host|coop|tech)|(jp|br|it|cn|mx|ar|nl|pl|ru|tr|tw|za|be|uk|eg|es|fi|pt|th|nz|cz|hu|gr|dk|il|sg|uy|lt|ua|ie|ir|ve|kz|ec|rs|sk|py|bg|hk|eu|ee|md|is|my|lv|gt|pk|ni|by|ae|kr|su|vn|cy|am|ke))))))(?!(((ttp|tp|ttps):\\/\\/)|(ww\\.)|(n--)))"
Above solves just about everything except a string like "eurls:www.google.com,facebook.com,http://test.com/", which it returns as a single string. Tbh idk why I added gopher etc. Proof R code
if(T){
wierdurl<-vector()
wierdurl[1]<-"https://JP納豆.例.jp/dir1/納豆 "
wierdurl[2]<-"xn--jp-cd2fp15c.xn--fsq.jp "
wierdurl[3]<-"http://52.221.161.242/2018/11/23/biofourmis-collab"
wierdurl[4]<-"https://12000.org/ "
wierdurl[5]<-" https://vg-1.com/?page_id=1002 "
wierdurl[6]<-"https://3dnews.ru/822878"
wierdurl[7]<-"The link of this question: https://stackoverflow.com/questions/6038061/regular-expression-to-find-urls-within-a-string
Also there are some urls: www.google.com, facebook.com, http://test.com/method?param=wasd
The code below catches all urls in text and returns urls in list. "
wierdurl[8]<-"Thelinkofthisquestion:https://stackoverflow.com/questions/6038061/regular-expression-to-find-urls-within-a-string
Alsotherearesomeurls:www.google.com,facebook.com,http://test.com/method?param=wasd
Thecodebelowcatchesallurlsintextandreturnsurlsinlist. "
wierdurl[9]<-"Thelinkofthisquestion:https://stackoverflow.com/questions/6038061/regular-expression-to-find-urls-within-a-stringAlsotherearesomeurlsZwww.google.com,facebook.com,http://test.com/method?param=wasdThecodebelowcatchesallurlsintextandreturnsurlsinlist."
wierdurl[10]<-"1facebook.com/1res"
wierdurl[11]<-"1facebook.com/1res/wat.txt"
wierdurl[12]<-"www.e "
wierdurl[13]<-"is this the file.txt i need"
wierdurl[14]<-"xn--jp-cd2fp15c.xn--fsq.jpinspiredby "
wierdurl[15]<-"[xn--jp-cd2fp15c.xn--fsq.jp/inspiredby "
wierdurl[16]<-"xnto--jpto-cd2fp15c.xnto--fsq.jpinspiredby "
wierdurl[17]<-"fsety--fwdvg-gertu56.ffuoiw--ffwsx.3dinspiredby "
wierdurl[18]<-"://3dnews.ru/822878 "
wierdurl[19]<-" http://mywebsite.com/msn.co.uk "
wierdurl[20]<-" 2.0http://www.abe.hip "
wierdurl[21]<-"www.abe.hip"
wierdurl[22]<-"hardware/software/data"
regexstring<-vector()
regexstring[2]<-"(http|ftp|https)://([\\w_-]+(?:(?:\\.[\\w_-]+)+))([\\w.,@?^=%&:/~+#-]*[\\w@?^=%&/~+#-])?"
regexstring[3]<-"/(?:(?:https?|ftp|file):\\/\\/|www\\.|ftp\\.)(?:\\([-A-Z0-9+&@#\\/%=~_|$?!:,.]*\\)|[-A-Z0-9+&@#\\/%=~_|$?!:,.])*(?:\\([-A-Z0-9+&@#\\/%=~_|$?!:,.]*\\)|[A-Z0-9+&@#\\/%=~_|$])/igm"
regexstring[4]<-"[a-zA-Z0-9\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF]?"
regexstring[5]<-"((http|ftp|https)\\:\\/\\/)?([\\w_-]+(?:(?:\\.[\\w_-]+)+))([\\w.,@?^=%&:/~+#-]*[\\w@?^=%&/~+#-])?"
regexstring[6]<-"((http|ftp|https):\\/\\/)?([\\w_-]+(?:(?:\\.[\\w_-]+)+))([\\w.,@?^=%&:\\/~+#-]*[\\w@?^=%&\\/~+#-])?"
regexstring[7]<-"(http|ftp|https)(:\\/\\/)([\\w_-]+(?:(?:\\.[\\w_-]+)+))([\\w.,@?^=%&:/~+#-]*[\\w@?^=%&/~+#-])?"
regexstring[8]<-"(?:(?:https?|ftp|file):\\/\\/|www\\.|ftp\\.)(?:\\([-A-Z0-9+&@#/%=~_|$?!:,.]*\\)|[-A-Z0-9+&@#/%=~_|$?!:,.])*(?:\\([-A-Z0-9+&@#/%=~_|$?!:,.]*\\)|[A-Z0-9+&@#/%=~_|$])"
regexstring[10]<-"((http[s]?|ftp):\\/)?\\/?([^:\\/\\s]+)((\\/\\w+)*\\/)([\\w\\-\\.]+[^#?\\s]+)(.*)?(#[\\w\\-]+)?"
regexstring[12]<-"http[s:/]+[[:alnum:]./]+"
regexstring[9]<-"http[s:/]+[[:alnum:]./]+" #in DLpages 230
regexstring[1]<-"[[:alnum:]-]+?[.][:alnum:]+?(?=[/ :])" #in link_graphs 50
regexstring[13]<-"^(?!mailto:)(?:(?:http|https|ftp)://)(?:\\S+(?::\\S*)?@)?(?:(?:(?:[1-9]\\d?|1\\d\\d|2[01]\\d|22[0-3])(?:\\.(?:1?\\d{1,2}|2[0-4]\\d|25[0-5])){2}(?:\\.(?:[0-9]\\d?|1\\d\\d|2[0-4]\\d|25[0-4]))|(?:(?:[a-z\\u00a1-\\uffff0-9]+-?)*[a-z\\u00a1-\\uffff0-9]+)(?:\\.(?:[a-z\\u00a1-\\uffff0-9]+-?)*[a-z\\u00a1-\\uffff0-9]+)*(?:\\.(?:[a-z\\u00a1-\\uffff]{2,})))|localhost)(?::\\d{2,5})?(?:(/|\\?|#)[^\\s]*)?$"
regexstring[14]<-"(((((http|ftp|https):\\/\\/)|(www\\.)|(xn--)){1}([\\w_-]+(?:(?:\\.[\\w_-]+)+))([\\w.,@?^=%&:\\/~+#-]*[\\w@?^=%&\\/~+#-])?)|(([\\w_-]+(?:(?:\\.[\\w_-]+)*))((\\.((org|com|net|edu|gov|mil|int)|(([:alpha:]{2})(?=[, ]))))|([\\/]([\\w.,@?^=%&:\\/~+#-]*[\\w@?^=%&\\/~+#-])?))))(?!(((ttp|tp|ttps):\\/\\/)|(ww\\.)|(n--)))"
regexstring[15]<-"(((((http|ftp|https|gopher|telnet|file|localhost):\\/\\/)|(www\\.)|(xn--)){1}([\\w_-]+(?:(?:\\.[\\w_-]+)+))([\\w.,@?^=%&:\\/~+#-]*[\\w@?^=%&\\/~+#-])?)|(([\\w_-]{2,200}(?:(?:\\.[\\w_-]+)*))((\\.[\\w_-]+\\/([\\w.,@?^=%&:\\/~+#-]*[\\w@?^=%&\\/~+#-])?)|(\\.((org|com|net|edu|gov|mil|int|arpa|biz|info|unknown|one|ninja|network|host|coop|tech)|(jp|br|it|cn|mx|ar|nl|pl|ru|tr|tw|za|be|uk|eg|es|fi|pt|th|nz|cz|hu|gr|dk|il|sg|uy|lt|ua|ie|ir|ve|kz|ec|rs|sk|py|bg|hk|eu|ee|md|is|my|lv|gt|pk|ni|by|ae|kr|su|vn|cy|am|ke))))))(?!(((ttp|tp|ttps):\\/\\/)|(ww\\.)|(n--)))"
}
for(i in wierdurl){#c(7,22)
for(c in regexstring[c(15)]) {
print(paste(i,which(regexstring==c)))
print(str_extract_all(i,c))
}
}
I use this Regex:-
((\w+:\/\/\S+)|(\w+[\.:]\w+\S+))[^\s,\.]
It works fine for many URLs, including: http://google.com
, https://dev-site.io:8080/home?val=1&count=100
, www.regexr.com
, localhost:8080/path
, ...
I have utilize c# Uri class and it works, well with IP Address, localhost
public static bool CheckURLIsValid(string url)
{
Uri returnURL;
return (Uri.TryCreate(url, UriKind.Absolute, out returnURL)
&& (returnURL.Scheme == Uri.UriSchemeHttp || returnURL.Scheme == Uri.UriSchemeHttps));
}
This regex is perfectly working for me, should work for you too
(http|ftp|https)://([\w_-]+(?:(?:\.[\w_-]+)+))([\w.,@?^=%&:/~+#-]*[\w@?^=%&/~+#-])?
This is a slight improvement on/adjustment to (depending on what you need) Rajeev's answer:
([\w\-_]+(?:(?:\.|\s*\[dot\]\s*[A-Z\-_]+)+))([A-Z\-\.,@?^=%&:/~\+#]*[A-Z\-\@?^=%&/~\+#]){2,6}?
See here for an example of what it does and does not match.
I got rid of the check for "http" etc as I wanted to catch url's without this. I added slightly to the regex to catch some obfuscated urls (i.e. where user's use [dot] instead of a "."). Finally I replaced "\w" with "A-Z" to and "{2,3}" to reduce false positives like v2.0 and "moo.0dd".
Any improvements on this welcome.
-
[a-zA-Z]{2,3}
is really poor for matching TLD, see official list: data.iana.org/TLD/tlds-alpha-by-domain.txt. Also your regex matches_.........&&&&&&
not sure it's a valid url.Toto– Toto2015年01月19日 11:06:04 +00:00Commented Jan 19, 2015 at 11:06 -
Thanks for that JE SUIS CHAELIE, any suggestions for improvement (especially for the false positive)?avjaarsveld– avjaarsveld2015年01月19日 16:31:55 +00:00Commented Jan 19, 2015 at 16:31
I use the logic of finding text between two dots or periods
the regex below works fine with python
(?<=\.)[^}]*(?=\.)
Matching a URL in a text should not be so complex
(?:(?:(?:ftp|http)[s]*:\/\/|www\.)[^\.]+\.[^ \n]+)
I used this
^(https?:\\/\\/([a-zA-z0-9]+)(\\.[a-zA-z0-9]+)(\\.[a-zA-z0-9\\/\\=\\-\\_\\?]+)?)$
(?:vnc|s3|ssh|scp|sftp|ftp|http|https)\:\/\/[\w\.]+(?:\:?\d{0,5})|(?:mailto|)\:[\w\.]+\@[\w\.]+
If you want an explanation of each part, try in regexr[.]com where you will get a great explanation of every character.
This is split by an "|" or "OR" because not all useable URI have "//" so this is where you can create a list of schemes as or conditions that you would be interested in matching.
How about this one?
(http:\/\/|ftp:\/\/|https:\/\/|www\.)([\w_-]+(?:(?:\.[\w_-]+)+))([\w.,@?^=%&:\/~+#-]*[\w@?^=%&\/~+#-])?
It matches both in the question.
This slightly simpler version of GooDeeJAY's answer serves me well (and supports e.g. # and other characters at the expense of increasing 'false positives'):
import re
text = """The link of this question: https://stackoverflow.com/questions/6038061/regular-expression-to-find-urls-within-a-string
Also there are some urls: www.google.com, facebook.com, http://test.com/method?param=wasd, http://test.com/method?param=wasd¶ms2=kjhdkjshd#changed
The code below catches all urls in text and returns urls in list."""
regex = r"(?i)(https?://|www.|\w+\.)[^\s]+"
urls = [match.group() for match in re.finditer(regex, text)]
print(urls)
and outputs
[
'https://stackoverflow.com/questions/6038061/regular-expression-to-find-urls-within-a-string',
'www.google.com,',
'facebook.com,',
'http://test.com/method?param=wasd,',
'http://test.com/method?param=wasd¶ms2=kjhdkjshd#changed'
]
This expression also finds paths like: /path/text.html
(https?\:\/[^\"\'\n\<\>\;\)\s]*)|(www?\.[^\"\'\n\<\>\;\s]*)|([^\s\&\=\;,円\<\<\>\"\'\(\)]+\/[\w\/])([^\"\'\n\;\s]*)|((?<!\<)[\/]+[\w]+[^\'\"\s\<\>]*)
^(http|https|ftp|ftps)\:\/\/[a-zA-Z0-9\-\.]+\.[a-zA-Z]{2,3}(\/\S*)?
This will verify the url link....
preg_match_all('#\bhttps?://[^\s()<>]+(?:\([\w\d]+\)|([^[:punct:]\s]|/))#', $string, $match);
from stackoverflow.com/q/910912/1066234