Web Log Tools
As in tools for web server logs, not the web logs commonly called "blogs".
In the early 2000s, I was doing a lot of very specific log analysis. At the
time I was "webmaster" for a site with ads. To justify ad sales, the company
paid for a web server log audit service. This provided the main log reports
looked at by the company, but sometimes I'd be called on to explain things.
So I had to dive into the logs and examine them myself.
logprint. Today this tool is not going to be widely useful, instead
people will use an ELK stack and define a
grok rule in the
of ELK. But initial release of
logstash was 2010, long after I wrote
logprint does is parse log files of various formats — I
defined four that I've had to work with, adding more is an excerise
regular expression writing, same as with
grok — into columns.
Some of those columns can be sub-parsed. For example, the Apache request
line column can be broken down into a method ("GET", "POST", "HEAD",
etc), a URI (the actual requested resource) and an optional protocol
(not present for HTTP 0.9 or present as "HTTP/1.0" or "HTTP/1.1").
After parsing the line, it can be filtered: only consider requests that
succeeded (2xx), and were over 200,000 bytes; then selectively print
some of the columns for that entry, say date, referer, URI.
# Apache "combined" has referer as a column ("common" does not)
# status >= 200 and status <= 299 is any 2xx response
# @uri will only be the local file name, discarding a full hostname
# on the request line and CGI parameters
logprint -f combined \
-F 'status>=200' -F 'status<=299' -F 'bytes>200000' \
Things like parsing the file part into URI when you get a request with
the full URL on the GET line is an unusual need, but I needed it then
and it is still useful now. The same parsing rules for a full URL there
are also available for parsing
Referer: headers, which was once useful
for pulling out search terms used from referring search engines.
logprint is a very handy slice and dice tool for web logs. It can
be combined with another tool I wrote,
adder which aims to be a full
featured "add values up" tool. You can feed in columns of numbers and
get columns of sums. You can feed in columns of numbers and get a sum
per line. You can feed in value and keyword and get sums per keyword.
That last one is rather useful in combination with
# using Apache "common" format, find lines with status 200,
# print bytes used and the first directory component of the URI file part
# pipe that to adder,
# suppress column headers,
# use column 0 as value to add,
# and column 1 as label
logprint -f common --filter status=200 -c bytes,file:@path1 $log |
adder -n -b 1 -r 0
That gets output like this (although this was sorted):
So simple to see where the bytes are coming from. Looking at that, I
decied I really should better compress the "apple-touch-icon.png". I'm
not sure I can get "faveicon.ico" smaller, at least not with the
features it has now. And the CSS and other icons in /i/ also got some
Then I looked at bytes per day to see if adding a sitemap helped. It
does, but the difference is slight, easy to lose in the weekly cycle.
Usage really picked up in April, didn't it?
$ cat by-day-usage
if [ ! -f "$log" ] ; then echo "$0: usage by-day-usage LOGFILE[.gz]"; exit 2 ; fi
logprint -f common --filter status=200 -c bytes,date $log $@ |
adder -n -b 1 -r 0
publishing these log tool scripts
for anyone interested in similar simple
log slicing and dicing. It's not
webalizer but it's not
trying to be either.