I'm working on a project that requires POS Tagging of paragraphs. The text contains lot of URLs which contain various punctuation marks such as .
?
. This affects the accuracy of the sentence tokenization.
So I decided to clean the data by removing/replacing all the URLs. And thought regular expressions would be handy in doing that. Remember, I want to match not just the domain name but the complete URL.
For example, in the below text,
Hey Mayur, We have successfully credited the cashback amount of INR 600 in your CITRUS wallet linked to mobile - 9130977755. Please download and rate our apps here - http://smarturl.it/ixigoapps Happy travelling! Team ixigo
I want the regex to match http://smarturl.it/ixigoapps
Here's what I came up with so far
https?:\/\/?[-a-zA-Z0-9@:%._\+~#=]{2,256}\.[a-z]{2,6}\b[-a-zA-Z0-9@:%_\+.~#?&=\/]*
Right now, I'm not concerned about protocols other than http
and https
, but I would like to match URLs without http
or https
but start with www
. Example: www.acttv.in
What do you think of my approach so far, and the regular expression?
-
2\$\begingroup\$ See also: mathiasbynens.be/demo/url-regex \$\endgroup\$Richard Neumann– Richard Neumann2017年11月23日 15:00:51 +00:00Commented Nov 23, 2017 at 15:00
-
1\$\begingroup\$ @RichardNeumann Because they're using python's regex engine. @close-voters this question is on-topic. \$\endgroup\$Peilonrayz– Peilonrayz ♦2017年11月23日 15:10:34 +00:00Commented Nov 23, 2017 at 15:10
-
\$\begingroup\$ @RichardNeumann, thanks for the link. I'm testing few of them out with my text corpus. \$\endgroup\$Thirupathi Thangavel– Thirupathi Thangavel2017年11月23日 16:26:19 +00:00Commented Nov 23, 2017 at 16:26
1 Answer 1
Have you considered:
re.compile('(www|http)\S+')
I tested it on your example string and it worked well (for both www
and http
). This does assume your links are likely to be links, and your data is fairly clean/consistent.
Explore related questions
See similar questions with these tags.