Before I learned that GoAccess was a thing, I wanted an analytics solution that I could use locally on my web server. My solution was to write a bash script that would give me some basic info on how my blog is doing. What it attempts to do is to:
- Find all views of a given page (as determined by lines in the
access.log
) - Filter out all bots and crawlers
- Filter myself out (as I tend to look over my own site quite often)
- Determine some counts based off the remaining number of lines
What I'm curious about is: are there any faulty assumptions, and are there any glaring inefficiencies in this process?
#!/bin/bash
# Initialize some values
PAGE="1ドル"
LOG="/var/log/apache2/home-site/access.log"
# Words in user agents that hint that the page view is not a person
BOTS=("bot" "facebookexternalhit" "crawler")
# There are other filtered IP's but I've removed them for privacy reasons
FILTERED_IPS=("$(curl -s https://icanhazip.com)")
# Most pages that I care about are <url>/blog/<post name>
# But not all of them are
if [ "${PAGE:0:1}" != "/" ]
then
PAGE="/blog/$PAGE"
fi
echo "$PAGE"
# Get all views for our page
BLOG_VIEWS=$(grep -a "GET $PAGE" $LOG)
BLOG_VIEW_COUNT=$(echo "$BLOG_VIEWS" | wc -l)
BLOG_VIEWS_FILTERED="$BLOG_VIEWS"
# Clear out the bots
for BOT in "${BOTS[@]}"
do
TEMP=$(echo "$BLOG_VIEWS_FILTERED" | grep -avi "$BOT")
BLOG_VIEWS_FILTERED="$TEMP"
done
# Clear out any filtered IP addresses
# Usually their just me
for IP in "${FILTERED_IPS[@]}"
do
TEMP=$(echo "$BLOG_VIEWS_FILTERED" | grep -av "$IP")
BLOG_VIEWS_FILTERED="$TEMP"
done
TOTAL_VIEWS=$(echo "$BLOG_VIEWS_FILTERED" | wc -l)
UNIQUE_IPS=$(echo "$BLOG_VIEWS_FILTERED" | awk '{print 1ドル}' | sort | uniq)
UNIQUE_IP_COUNT=$(echo "$UNIQUE_IPS" | wc -l)
echo "Total Legitimate Views: $TOTAL_VIEWS"
echo "Legitimate View Percentage:" $( echo "printf('%.2f', ($TOTAL_VIEWS/$BLOG_VIEW_COUNT) * 100)" | perl)
echo "Across $UNIQUE_IP_COUNT unique IP addresses"
An example of use looks like the following
:~$ .scripts/analytics hello-world
/blog/hello-world
Total Legitimate Views: 64
Legitimate View Percentage: 47.06
Across 36 unique IP addresses
1 Answer 1
This script is not too long, but it's getting long enough that you really ought to be looking for opportunities to break out helper functions.
You have a pair of for
loops.
for BOT in "${BOTS[@]}"
...
for IP in "${FILTERED_IPS[@]}"
Each would make a lovely helper function.
Running grep -v
N times is one way
to remove N patterns from the input.
Consider building up a big regex with N elements,
e.g. grep -v "bot|facebookexternalhit|crawler"
,
to save on repeated reads of a bunch of log entries.
It is unclear why you specified --binary-files=text
(grep -a
). It merits at least a # comment
,
and maybe in a separate file a unit test that
demonstrates what effect it has.
The grep filtering here is done at a rather coarse level.
Consider handing the regex to awk
instead,
so that you can match a specific field in the log line.
This prevents spurious matches on strings
like GET /blog/a-tale-of-two-bots.html
.
UNIQUE_IPS=$(echo "$BLOG_VIEWS_FILTERED" | awk '{print 1ドル}' | sort | uniq)
UNIQUE_IP_COUNT=$(echo "$UNIQUE_IPS" | wc -l)
Given that we discard UNIQUE_IPS, consider putting wc
in that same pipeline, right after uniq
.
echo "Legitimate View Percentage:" $( echo "printf('%.2f', ($TOTAL_VIEWS/$BLOG_VIEW_COUNT) * 100)" | perl)
There are many things in this script that we
might have pressed perl
into service for.
But this one seems odd, given that bash
can do arithmetic and format numbers,
or use /usr/bin/printf or the builtin printf
.
This seems a bit heavyweight.
Also: div by zero? Maybe append a single blog view line to avoid that?