This is a script I created that downloads Project Euler webpages and combines them to PDF. The script also downloads animated files.
#!/bin/sh
for i in $(seq -f "%03g" 1ドル 2ドル); do
URL="https://projecteuler.net/problem=$i"
# chromium print to PDF, wait for rendering https://stackoverflow.com/a/49789027
chromium-browser --headless --disable-gpu --run-all-compositor-stages-before-draw --virtual-time-budget=10000 --print-to-pdf-no-header --print-to-pdf=$i.pdf $URL
# Distill PDFs to workaround Ghostscript skipped character problem https://stackoverflow.com/questions/12806911
gs -dBATCH -dNOPAUSE -sDEVICE=pdfwrite -o ${i}_gs.pdf $i.pdf
# download extra txt and GIF files if available, printing links to shell
curl -s $URL | pup 'a attr{href}' | grep '\.txt$' | tee /dev/tty | sed 's/^/https:\/\/projecteuler.net\//' | xargs -r -n1 curl -O
curl -s $URL | pup 'img attr{src}' | grep '\.gif$' | tee /dev/tty | sed 's/^/https:\/\/projecteuler.net\//' | xargs -r -n1 curl -O
done
# remove non-animated GIFs
for i in *.gif; do
[ $(identify "$i" | wc -l) -le 1 ] && rm -v "$i"
done
# combine all PDFs using gs
gs -dBATCH -dNOPAUSE -sDEVICE=pdfwrite -dPDFSETTINGS=/ebook -sOutputFile=problems.pdf *_gs.pdf
# create final zip
zip problems.zip problems.pdf *.txt *.gif
- Is there a way to get rid of the two nearly duplicate lines which only differ in pup and grep?
- Is it possible to write the curl lines so that an image is saved only if ImageMagick
identify
indentifies as gif, instead of cleaning up afterwards? Is this cleaner? - I'm not very experienced writing shell scripts so let me know if there are any style issues or better ways to use sed, grep, xargs, etc.
1 Answer 1
Here's a refactoring with various fixes.
- Generally quote shell variables.
- Don't read lines with
for
. - I switched the
URL
variable to lower case, in accordance with recommended convention. - This uses a temporary file for the
curl
output, and atrap
to clean it up. - The logic to process this file was refactored to a function
pupcurl
. I switched the regex separator in thesed
script to%
to reduce the need for backslashes. - Use
./
prefix for glob expressions so as to avoid having file names which start with dashes be interpreted as (presumably invalid) options. $i
in the URL is not defined outside the loop. Move the assignment inside the loop.
For the record, there is no way to run identify
on an image without downloading it first.
#!/bin/sh
tmp=$(mktemp -t pdfstitcher.XXXXXXXX) || exit
trap 'rm -f "$tmp"' EXIT
pupcurl () {
pup "1ドル" | grep "2ドル" |
tee /dev/tty |
sed 's%^%https://projecteuler.net/%' |
xargs -r -n1 curl -O
}
seq -f "%03g" "1ドル" "2ドル" |
while read -r i; do
url="https://projecteuler.net/problem=$i"
chromium-browser --headless --disable-gpu \
--run-all-compositor-stages-before-draw \
--virtual-time-budget=10000 \
--print-to-pdf-no-header \
--print-to-pdf="$i.pdf" "$url"
gs -dBATCH -dNOPAUSE -sDEVICE=pdfwrite \
-o "${i}_gs.pdf" "$i.pdf"
curl -s "$url" >"$tmp"
pupcurl 'a attr{href}' '\.txt$' <"$tmp"
pupcurl 'img attr{src}' '\.gif$' <"$tmp"
done
for i in ./*.gif; do
[ $(identify "$i" | wc -l) -le 1 ] && rm -v "$i"
done
gs -dBATCH -dNOPAUSE -sDEVICE=pdfwrite \
-dPDFSETTINGS=/ebook \
-sOutputFile=problems.pdf ./*_gs.pdf
zip problems.zip problems.pdf ./*.txt ./*.gif
Depending on what you want to accomplish, perhaps all of this should run in a temporary directory which you remove when you're done (mktemp -d
).
I'm not entirely happy with the identify
call. I assume the code looks for GIF files where identify
outputs more than one line, and removes the rest. This looks suspiciously like a useless use of wc
but I'm drawing a blank when it comes to actually improving it. Perhaps it could be broken out into a separate function with some comments to explain it, or maybe use Awk to postprocess the result:
keep_only_animated_gifs () {
for i in ./*.gif; do
identify "$i" |
awk 'NR == 2 { exit 1 }' ||
rm -v "$i"
done
}
If you are willing to change from a sh
script to a Bash script, the temporary file (and thus also the trap
) could be avoided with something like
curl -s "$url" |
tee >(pupcurl 'a attr{href}' '\.txt$') |
pupcurl 'img attr{src}' '\.gif$'
The >(...)
is a process substitution which is a Bash extension. Perhaps see also Difference between sh and bash
The tee /dev/tty
is slightly dubious but I left it in. If the intent is to display the output to the user as a progress message, perhaps tee /dev/stderr
instead, though then ideally the output should have a bit of an explanation, too. (A common convention is to include the generating script's name in all diagnostic messages, so you can see which script is emitting it when you have scripts calling scripts calling scripts etc.)
Based on feedback in the comments, I removed the Bashism to trap ... ERR
too (good catch! And insidious to run Shellckeck when clearly I didn't :-) which means this could leave temporary files behind if curl
fails to connect, for example. Maybe add set -e
at the top to cover that case, too; but I haven't combed over the script to check whether it's otherwise set -e
-safe. (In particular, could gs
fail spuriously?)
Finally, probably try http://shellcheck.net/ before asking for human assistance. It can suggest several of the changes here automatically.
-
2\$\begingroup\$ ShellCheck says: In POSIX sh, trapping ERR is undefined. [SC3047]. \$\endgroup\$Léa Gris– Léa Gris2022年05月29日 22:20:03 +00:00Commented May 29, 2022 at 22:20
-
\$\begingroup\$
identify "$i" | { read -r id1 && read -r id2;} || rm -v "$i"
\$\endgroup\$Léa Gris– Léa Gris2022年05月29日 23:10:00 +00:00Commented May 29, 2022 at 23:10 -
\$\begingroup\$ Does don't read lines with for apply mainly to arbitrary input files? because we know what seq should output \$\endgroup\$qwr– qwr2022年05月30日 19:35:33 +00:00Commented May 30, 2022 at 19:35
-
\$\begingroup\$ There's also the issue of needlessly having the shell collect a list into memory to loop over. Perhaps see also the
ARG_MAX
discussion on the "useless use of ..." page. \$\endgroup\$tripleee– tripleee2022年05月31日 04:25:21 +00:00Commented May 31, 2022 at 4:25 -
\$\begingroup\$ I found out later that chromium itself has some bug where about 75% of the time the page doesn't render at all. I will have to look into alternatives that can still handle rendering mathjax with js (main challenge in rendering) \$\endgroup\$qwr– qwr2022年08月04日 07:11:05 +00:00Commented Aug 4, 2022 at 7:11
/bin/sh
. The near-duplicated lines would be easy to refactor with a process substitution but that requires the shebang to be changed to invokebash
instead. Please edit to clarify whether this is acceptable or perhaps even intended by the conflicting tags. \$\endgroup\$