Isolate URLs from other text around it and put each on own line (with sed)

Question 1

I have a small shell script that uses sed to take an input file of URLs with hand-written notes around them and strips the notes and puts each URL on its own line. For example:

INPUT:

note: http://www.example.com/Beat-Poetry + ??? http://www.example.com/beat+poetry
http://www.example.com/17th+cent. + http://www.example.com/17th+century + http://www.example.com/17th+c.
http://www.example.com/18th+century
https://www.example.com/C19th-C20th + http://www.example.com/19th-20th+century (note)
note:
http://www.example.com/18th+cent. note http://www.example.com/18th+century

Note: the URLs will always either have leading/trailing space or start or end the line.

DESIRED OUTPUT:

http://www.example.com/Beat-Poetry
http://www.example.com/beat+poetry
http://www.example.com/17th+cent.
http://www.example.com/17th+century
http://www.example.com/17th+c.
http://www.example.com/18th+century
http://www.example.com/C19th-C20th
http://www.example.com/19th-20th+century
http://www.example.com/18th+cent.
http://www.example.com/18th+century

I have this code, which does the job by adding some delimiters around each URL and removing stuff based on where the delimiters are found, but I am a newbie with this stuff, and it doesn't quite feel right. If nothing else, it's not robust enough to withstand potential usage of á and é characters in the "notes".

#!/bin/bash
 # squash out all the extra text that isn't URL (notes to self) and put each URL on a new line
 # hackish steps to achieve this:
 # - change urls from http://url to áhttp://urlé
 # - put each one on a new line
 # - remove leading space/words
 # - remove trailing space/words
 # - change any https to http
sed -re 's/(https?:[^ ]*)( |$)/á1円é /g' \
 -e 's/é[^á]*á/\n/g' \
 -e 's/(^[^á]*)(á[^é]*é)/2円/g' \
 -e 's/é[^á]*$//' \
 -e 's/https:/http:/g' 1ドル |
tr -d 'áé\r' |
sed -rn 's/(http:\/\/www.example.com\/.*)$/1円/p'

I assume that there is a more proper way to do this? (Again, the URLs will always have whitespace or ^ or $ around them.) I'd appreciate any improvement suggestions. Thanks.

Question 2

Why are you replacing https: with http:?

Question 3

Probably should have just left that line out of the question, as it's just for not having to deal with the differences later in another script (though I appreciate you including it in your answer).

Question 4

What about this?

cat input | tr ' ' '\n' | grep -i "^http"

Put every string on new line.
Filter urls.

Question 5

Yowzers, so much cleaner than what I had. I knew I was overthinking it. Thanks!

Question 6

This is essentially a word-splitting problem — something that Bash is actually quite good at doing, using only built-in commands. You certainly don't need to go insert and then strip out á and é as marker characters.

while read line ; do
 for word in $line ; do
 case "$word" in http*) echo ${word/#https:/http:} ;; esac
 done
done

Question 7

This one took my newbie brain a little longer to wrap my head around (and also my machine ... it looks like it's a bit slower than Jan's script ... 10 second vs. 2 seconds for nearly 25000 lines, but who's counting?), but kudos for making it all-bash, since IIRC cygwin required me to explicitly install tr and grep.

Jan Korous Jan Korous 9176 silver badges15 bronze badges · Answer 1 · 2016-06-23 22:23:12Z

4

\$\begingroup\$

What about this?

cat input | tr ' ' '\n' | grep -i "^http"

Put every string on new line.
Filter urls.

Share

edited Jun 23, 2016 at 22:29

answered Jun 23, 2016 at 22:23

Jan Korous's user avatar

Jan Korous Jan Korous

9176 silver badges15 bronze badges

\$\endgroup\$

1

\$\begingroup\$ Yowzers, so much cleaner than what I had. I knew I was overthinking it. Thanks! \$\endgroup\$

Max Starkenburg
– Max Starkenburg

2016年06月25日 20:48:44 +00:00
Commented Jun 25, 2016 at 20:48

Add a comment |

200_success 200_success 145k22 gold badges190 silver badges478 bronze badges · Answer 2 · 2016-06-24 20:01:14Z

1

\$\begingroup\$

This is essentially a word-splitting problem — something that Bash is actually quite good at doing, using only built-in commands. You certainly don't need to go insert and then strip out á and é as marker characters.

while read line ; do
 for word in $line ; do
 case "$word" in http*) echo ${word/#https:/http:} ;; esac
 done
done

Share

answered Jun 24, 2016 at 20:01

200_success's user avatar

200_success 200_success

145k22 gold badges190 silver badges478 bronze badges

\$\endgroup\$

1

\$\begingroup\$ This one took my newbie brain a little longer to wrap my head around (and also my machine ... it looks like it's a bit slower than Jan's script ... 10 second vs. 2 seconds for nearly 25000 lines, but who's counting?), but kudos for making it all-bash, since IIRC cygwin required me to explicitly install tr and grep. \$\endgroup\$

Max Starkenburg
– Max Starkenburg

2016年06月25日 20:56:55 +00:00
Commented Jun 25, 2016 at 20:56

Add a comment |

Stack Exchange Network

Isolate URLs from other text around it and put each on own line (with sed)

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Isolate URLs from other text around it and put each on own line (with sed)

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions