4
\$\begingroup\$

I have a small shell script that uses sed to take an input file of URLs with hand-written notes around them and strips the notes and puts each URL on its own line. For example:

INPUT:

note: http://www.example.com/Beat-Poetry + ??? http://www.example.com/beat+poetry
http://www.example.com/17th+cent. + http://www.example.com/17th+century + http://www.example.com/17th+c.
http://www.example.com/18th+century
https://www.example.com/C19th-C20th + http://www.example.com/19th-20th+century (note)
note:
http://www.example.com/18th+cent. note http://www.example.com/18th+century

Note: the URLs will always either have leading/trailing space or start or end the line.

DESIRED OUTPUT:

http://www.example.com/Beat-Poetry
http://www.example.com/beat+poetry
http://www.example.com/17th+cent.
http://www.example.com/17th+century
http://www.example.com/17th+c.
http://www.example.com/18th+century
http://www.example.com/C19th-C20th
http://www.example.com/19th-20th+century
http://www.example.com/18th+cent.
http://www.example.com/18th+century

I have this code, which does the job by adding some delimiters around each URL and removing stuff based on where the delimiters are found, but I am a newbie with this stuff, and it doesn't quite feel right. If nothing else, it's not robust enough to withstand potential usage of á and é characters in the "notes".

#!/bin/bash
 # squash out all the extra text that isn't URL (notes to self) and put each URL on a new line
 # hackish steps to achieve this:
 # - change urls from http://url to áhttp://urlé
 # - put each one on a new line
 # - remove leading space/words
 # - remove trailing space/words
 # - change any https to http
sed -re 's/(https?:[^ ]*)( |$)/á1円é /g' \
 -e 's/é[^á]*á/\n/g' \
 -e 's/(^[^á]*)(á[^é]*é)/2円/g' \
 -e 's/é[^á]*$//' \
 -e 's/https:/http:/g' 1ドル |
tr -d 'áé\r' |
sed -rn 's/(http:\/\/www.example.com\/.*)$/1円/p'

I assume that there is a more proper way to do this? (Again, the URLs will always have whitespace or ^ or $ around them.) I'd appreciate any improvement suggestions. Thanks.

200_success
145k22 gold badges190 silver badges478 bronze badges
asked Jun 23, 2016 at 16:02
\$\endgroup\$
2
  • \$\begingroup\$ Why are you replacing https: with http:? \$\endgroup\$ Commented Jun 23, 2016 at 23:53
  • \$\begingroup\$ Probably should have just left that line out of the question, as it's just for not having to deal with the differences later in another script (though I appreciate you including it in your answer). \$\endgroup\$ Commented Jun 25, 2016 at 20:46

2 Answers 2

4
\$\begingroup\$

What about this?

cat input | tr ' ' '\n' | grep -i "^http"
  1. Put every string on new line.
  2. Filter urls.
answered Jun 23, 2016 at 22:23
\$\endgroup\$
1
  • \$\begingroup\$ Yowzers, so much cleaner than what I had. I knew I was overthinking it. Thanks! \$\endgroup\$ Commented Jun 25, 2016 at 20:48
1
\$\begingroup\$

This is essentially a word-splitting problem — something that Bash is actually quite good at doing, using only built-in commands. You certainly don't need to go insert and then strip out á and é as marker characters.

while read line ; do
 for word in $line ; do
 case "$word" in http*) echo ${word/#https:/http:} ;; esac
 done
done
answered Jun 24, 2016 at 20:01
\$\endgroup\$
1
  • \$\begingroup\$ This one took my newbie brain a little longer to wrap my head around (and also my machine ... it looks like it's a bit slower than Jan's script ... 10 second vs. 2 seconds for nearly 25000 lines, but who's counting?), but kudos for making it all-bash, since IIRC cygwin required me to explicitly install tr and grep. \$\endgroup\$ Commented Jun 25, 2016 at 20:56

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.