Identify and extract URLs from text corpus

Asked 7 years, 10 months ago

Viewed 4k times

\$\begingroup\$

I'm working on a project that requires POS Tagging of paragraphs. The text contains lot of URLs which contain various punctuation marks such as . ?. This affects the accuracy of the sentence tokenization.

So I decided to clean the data by removing/replacing all the URLs. And thought regular expressions would be handy in doing that. Remember, I want to match not just the domain name but the complete URL.

For example, in the below text,

Hey Mayur, We have successfully credited the cashback amount of INR 600 in your CITRUS wallet linked to mobile - 9130977755. Please download and rate our apps here - http://smarturl.it/ixigoapps Happy travelling! Team ixigo

I want the regex to match http://smarturl.it/ixigoapps

Here's what I came up with so far

https?:\/\/?[-a-zA-Z0-9@:%._\+~#=]{2,256}\.[a-z]{2,6}\b[-a-zA-Z0-9@:%_\+.~#?&=\/]*

Right now, I'm not concerned about protocols other than http and https, but I would like to match URLs without http or https but start with www. Example: www.acttv.in

What do you think of my approach so far, and the regular expression?

edited Nov 23, 2017 at 15:11

Peilonrayz's user avatar

Peilonrayz ♦

44.4k7 gold badges80 silver badges157 bronze badges

asked Nov 23, 2017 at 12:10

Thirupathi Thangavel's user avatar

Thirupathi Thangavel Thirupathi Thangavel

2452 silver badges9 bronze badges

\$\endgroup\$

2

\$\begingroup\$ See also: mathiasbynens.be/demo/url-regex \$\endgroup\$

Richard Neumann
– Richard Neumann

2017年11月23日 15:00:51 +00:00
Commented Nov 23, 2017 at 15:00
1

\$\begingroup\$ @RichardNeumann Because they're using python's regex engine. @close-voters this question is on-topic. \$\endgroup\$

Peilonrayz
– Peilonrayz ♦

2017年11月23日 15:10:34 +00:00
Commented Nov 23, 2017 at 15:10
\$\begingroup\$ @RichardNeumann, thanks for the link. I'm testing few of them out with my text corpus. \$\endgroup\$

Thirupathi Thangavel
– Thirupathi Thangavel

2017年11月23日 16:26:19 +00:00
Commented Nov 23, 2017 at 16:26

Add a comment |

1 Answer 1

Sorted by: Reset to default

\$\begingroup\$

Have you considered:

re.compile('(www|http)\S+')

I tested it on your example string and it worked well (for both www and http). This does assume your links are likely to be links, and your data is fairly clean/consistent.

answered Jul 25, 2018 at 5:00

Alex L's user avatar

Alex L Alex L

5,7832 gold badges26 silver badges69 bronze badges

\$\endgroup\$

Add a comment |

Your Answer

Draft saved

Draft discarded

Sign up or log in

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.

lang-py

Stack Exchange Network

Identify and extract URLs from text corpus

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Identify and extract URLs from text corpus

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions