Performing analytics on access.log with Bash

Question 1

Before I learned that GoAccess was a thing, I wanted an analytics solution that I could use locally on my web server. My solution was to write a bash script that would give me some basic info on how my blog is doing. What it attempts to do is to:

Find all views of a given page (as determined by lines in the access.log)
Filter out all bots and crawlers
Filter myself out (as I tend to look over my own site quite often)
Determine some counts based off the remaining number of lines

What I'm curious about is: are there any faulty assumptions, and are there any glaring inefficiencies in this process?

#!/bin/bash
# Initialize some values
PAGE="1ドル"
LOG="/var/log/apache2/home-site/access.log"
# Words in user agents that hint that the page view is not a person
BOTS=("bot" "facebookexternalhit" "crawler")
# There are other filtered IP's but I've removed them for privacy reasons
FILTERED_IPS=("$(curl -s https://icanhazip.com)")
# Most pages that I care about are <url>/blog/<post name>
# But not all of them are
if [ "${PAGE:0:1}" != "/" ]
then
 PAGE="/blog/$PAGE"
fi
echo "$PAGE"
# Get all views for our page
BLOG_VIEWS=$(grep -a "GET $PAGE" $LOG)
BLOG_VIEW_COUNT=$(echo "$BLOG_VIEWS" | wc -l)
BLOG_VIEWS_FILTERED="$BLOG_VIEWS"
# Clear out the bots
for BOT in "${BOTS[@]}"
do
 TEMP=$(echo "$BLOG_VIEWS_FILTERED" | grep -avi "$BOT")
 BLOG_VIEWS_FILTERED="$TEMP"
done
# Clear out any filtered IP addresses
# Usually their just me
for IP in "${FILTERED_IPS[@]}"
do
 TEMP=$(echo "$BLOG_VIEWS_FILTERED" | grep -av "$IP")
 BLOG_VIEWS_FILTERED="$TEMP"
done
TOTAL_VIEWS=$(echo "$BLOG_VIEWS_FILTERED" | wc -l)
UNIQUE_IPS=$(echo "$BLOG_VIEWS_FILTERED" | awk '{print 1ドル}' | sort | uniq)
UNIQUE_IP_COUNT=$(echo "$UNIQUE_IPS" | wc -l)
echo "Total Legitimate Views: $TOTAL_VIEWS"
echo "Legitimate View Percentage:" $( echo "printf('%.2f', ($TOTAL_VIEWS/$BLOG_VIEW_COUNT) * 100)" | perl)
echo "Across $UNIQUE_IP_COUNT unique IP addresses"

An example of use looks like the following

:~$ .scripts/analytics hello-world
/blog/hello-world
Total Legitimate Views: 64
Legitimate View Percentage: 47.06
Across 36 unique IP addresses

Question 2

This script is not too long, but it's getting long enough that you really ought to be looking for opportunities to break out helper functions.

You have a pair of for loops.

for BOT in "${BOTS[@]}"
...
for IP in "${FILTERED_IPS[@]}"

Each would make a lovely helper function.

Running grep -v N times is one way to remove N patterns from the input. Consider building up a big regex with N elements, e.g. grep -v "bot|facebookexternalhit|crawler", to save on repeated reads of a bunch of log entries.

It is unclear why you specified --binary-files=text (grep -a). It merits at least a # comment, and maybe in a separate file a unit test that demonstrates what effect it has.

The grep filtering here is done at a rather coarse level. Consider handing the regex to awk instead, so that you can match a specific field in the log line. This prevents spurious matches on strings like GET /blog/a-tale-of-two-bots.html.

UNIQUE_IPS=$(echo "$BLOG_VIEWS_FILTERED" | awk '{print 1ドル}' | sort | uniq)
UNIQUE_IP_COUNT=$(echo "$UNIQUE_IPS" | wc -l)

Given that we discard UNIQUE_IPS, consider putting wc in that same pipeline, right after uniq.

echo "Legitimate View Percentage:" $( echo "printf('%.2f', ($TOTAL_VIEWS/$BLOG_VIEW_COUNT) * 100)" | perl)

There are many things in this script that we might have pressed perl into service for. But this one seems odd, given that bash can do arithmetic and format numbers, or use /usr/bin/printf or the builtin printf. This seems a bit heavyweight.

Also: div by zero? Maybe append a single blog view line to avoid that?

J_H J_H 41.5k3 gold badges38 silver badges157 bronze badges · Answer 1 · 2023-02-13 16:36:41Z

This script is not too long, but it's getting long enough that you really ought to be looking for opportunities to break out helper functions.

You have a pair of for loops.

for BOT in "${BOTS[@]}"
...
for IP in "${FILTERED_IPS[@]}"

Each would make a lovely helper function.

Running grep -v N times is one way to remove N patterns from the input. Consider building up a big regex with N elements, e.g. grep -v "bot|facebookexternalhit|crawler", to save on repeated reads of a bunch of log entries.

It is unclear why you specified --binary-files=text (grep -a). It merits at least a # comment, and maybe in a separate file a unit test that demonstrates what effect it has.

The grep filtering here is done at a rather coarse level. Consider handing the regex to awk instead, so that you can match a specific field in the log line. This prevents spurious matches on strings like GET /blog/a-tale-of-two-bots.html.

UNIQUE_IPS=$(echo "$BLOG_VIEWS_FILTERED" | awk '{print 1ドル}' | sort | uniq)
UNIQUE_IP_COUNT=$(echo "$UNIQUE_IPS" | wc -l)

Given that we discard UNIQUE_IPS, consider putting wc in that same pipeline, right after uniq.

echo "Legitimate View Percentage:" $( echo "printf('%.2f', ($TOTAL_VIEWS/$BLOG_VIEW_COUNT) * 100)" | perl)

There are many things in this script that we might have pressed perl into service for. But this one seems odd, given that bash can do arithmetic and format numbers, or use /usr/bin/printf or the builtin printf. This seems a bit heavyweight.

Also: div by zero? Maybe append a single blog view line to avoid that?

Stack Exchange Network

Performing analytics on access.log with Bash

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Performing analytics on access.log with Bash

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions