I have a small shell script that uses sed
to take an input file of URLs with hand-written notes around them and strips the notes and puts each URL on its own line. For example:
INPUT:
note: http://www.example.com/Beat-Poetry + ??? http://www.example.com/beat+poetry
http://www.example.com/17th+cent. + http://www.example.com/17th+century + http://www.example.com/17th+c.
http://www.example.com/18th+century
https://www.example.com/C19th-C20th + http://www.example.com/19th-20th+century (note)
note:
http://www.example.com/18th+cent. note http://www.example.com/18th+century
Note: the URLs will always either have leading/trailing space or start or end the line.
DESIRED OUTPUT:
http://www.example.com/Beat-Poetry
http://www.example.com/beat+poetry
http://www.example.com/17th+cent.
http://www.example.com/17th+century
http://www.example.com/17th+c.
http://www.example.com/18th+century
http://www.example.com/C19th-C20th
http://www.example.com/19th-20th+century
http://www.example.com/18th+cent.
http://www.example.com/18th+century
I have this code, which does the job by adding some delimiters around each URL and removing stuff based on where the delimiters are found, but I am a newbie with this stuff, and it doesn't quite feel right. If nothing else, it's not robust enough to withstand potential usage of á
and é
characters in the "notes".
#!/bin/bash
# squash out all the extra text that isn't URL (notes to self) and put each URL on a new line
# hackish steps to achieve this:
# - change urls from http://url to áhttp://urlé
# - put each one on a new line
# - remove leading space/words
# - remove trailing space/words
# - change any https to http
sed -re 's/(https?:[^ ]*)( |$)/á1円é /g' \
-e 's/é[^á]*á/\n/g' \
-e 's/(^[^á]*)(á[^é]*é)/2円/g' \
-e 's/é[^á]*$//' \
-e 's/https:/http:/g' 1ドル |
tr -d 'áé\r' |
sed -rn 's/(http:\/\/www.example.com\/.*)$/1円/p'
I assume that there is a more proper way to do this? (Again, the URLs will always have whitespace or ^
or $
around them.) I'd appreciate any improvement suggestions. Thanks.
2 Answers 2
What about this?
cat input | tr ' ' '\n' | grep -i "^http"
- Put every string on new line.
- Filter urls.
-
\$\begingroup\$ Yowzers, so much cleaner than what I had. I knew I was overthinking it. Thanks! \$\endgroup\$Max Starkenburg– Max Starkenburg2016年06月25日 20:48:44 +00:00Commented Jun 25, 2016 at 20:48
This is essentially a word-splitting problem — something that Bash is actually quite good at doing, using only built-in commands. You certainly don't need to go insert and then strip out á
and é
as marker characters.
while read line ; do
for word in $line ; do
case "$word" in http*) echo ${word/#https:/http:} ;; esac
done
done
-
\$\begingroup\$ This one took my newbie brain a little longer to wrap my head around (and also my machine ... it looks like it's a bit slower than Jan's script ... 10 second vs. 2 seconds for nearly 25000 lines, but who's counting?), but kudos for making it all-
bash
, since IIRC cygwin required me to explicitly installtr
andgrep
. \$\endgroup\$Max Starkenburg– Max Starkenburg2016年06月25日 20:56:55 +00:00Commented Jun 25, 2016 at 20:56
https:
withhttp:
? \$\endgroup\$