2
\$\begingroup\$

Before I learned that GoAccess was a thing, I wanted an analytics solution that I could use locally on my web server. My solution was to write a bash script that would give me some basic info on how my blog is doing. What it attempts to do is to:

  1. Find all views of a given page (as determined by lines in the access.log)
  2. Filter out all bots and crawlers
  3. Filter myself out (as I tend to look over my own site quite often)
  4. Determine some counts based off the remaining number of lines

What I'm curious about is: are there any faulty assumptions, and are there any glaring inefficiencies in this process?

#!/bin/bash
# Initialize some values
PAGE="1ドル"
LOG="/var/log/apache2/home-site/access.log"
# Words in user agents that hint that the page view is not a person
BOTS=("bot" "facebookexternalhit" "crawler")
# There are other filtered IP's but I've removed them for privacy reasons
FILTERED_IPS=("$(curl -s https://icanhazip.com)")
# Most pages that I care about are <url>/blog/<post name>
# But not all of them are
if [ "${PAGE:0:1}" != "/" ]
then
 PAGE="/blog/$PAGE"
fi
echo "$PAGE"
# Get all views for our page
BLOG_VIEWS=$(grep -a "GET $PAGE" $LOG)
BLOG_VIEW_COUNT=$(echo "$BLOG_VIEWS" | wc -l)
BLOG_VIEWS_FILTERED="$BLOG_VIEWS"
# Clear out the bots
for BOT in "${BOTS[@]}"
do
 TEMP=$(echo "$BLOG_VIEWS_FILTERED" | grep -avi "$BOT")
 BLOG_VIEWS_FILTERED="$TEMP"
done
# Clear out any filtered IP addresses
# Usually their just me
for IP in "${FILTERED_IPS[@]}"
do
 TEMP=$(echo "$BLOG_VIEWS_FILTERED" | grep -av "$IP")
 BLOG_VIEWS_FILTERED="$TEMP"
done
TOTAL_VIEWS=$(echo "$BLOG_VIEWS_FILTERED" | wc -l)
UNIQUE_IPS=$(echo "$BLOG_VIEWS_FILTERED" | awk '{print 1ドル}' | sort | uniq)
UNIQUE_IP_COUNT=$(echo "$UNIQUE_IPS" | wc -l)
echo "Total Legitimate Views: $TOTAL_VIEWS"
echo "Legitimate View Percentage:" $( echo "printf('%.2f', ($TOTAL_VIEWS/$BLOG_VIEW_COUNT) * 100)" | perl)
echo "Across $UNIQUE_IP_COUNT unique IP addresses"

An example of use looks like the following

:~$ .scripts/analytics hello-world
/blog/hello-world
Total Legitimate Views: 64
Legitimate View Percentage: 47.06
Across 36 unique IP addresses
asked Feb 13, 2023 at 15:27
\$\endgroup\$

1 Answer 1

3
\$\begingroup\$

This script is not too long, but it's getting long enough that you really ought to be looking for opportunities to break out helper functions.


You have a pair of for loops.

for BOT in "${BOTS[@]}"
...
for IP in "${FILTERED_IPS[@]}"

Each would make a lovely helper function.

Running grep -v N times is one way to remove N patterns from the input. Consider building up a big regex with N elements, e.g. grep -v "bot|facebookexternalhit|crawler", to save on repeated reads of a bunch of log entries.

It is unclear why you specified --binary-files=text (grep -a). It merits at least a # comment, and maybe in a separate file a unit test that demonstrates what effect it has.

The grep filtering here is done at a rather coarse level. Consider handing the regex to awk instead, so that you can match a specific field in the log line. This prevents spurious matches on strings like GET /blog/a-tale-of-two-bots.html.


UNIQUE_IPS=$(echo "$BLOG_VIEWS_FILTERED" | awk '{print 1ドル}' | sort | uniq)
UNIQUE_IP_COUNT=$(echo "$UNIQUE_IPS" | wc -l)

Given that we discard UNIQUE_IPS, consider putting wc in that same pipeline, right after uniq.


echo "Legitimate View Percentage:" $( echo "printf('%.2f', ($TOTAL_VIEWS/$BLOG_VIEW_COUNT) * 100)" | perl)

There are many things in this script that we might have pressed perl into service for. But this one seems odd, given that bash can do arithmetic and format numbers, or use /usr/bin/printf or the builtin printf. This seems a bit heavyweight.

Also: div by zero? Maybe append a single blog view line to avoid that?

answered Feb 13, 2023 at 16:36
\$\endgroup\$

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.