Web Log Tools
As in tools for web server logs, not the web logs commonly called "blogs".
In the early 2000s, I was doing a lot of very specific log analysis. At the time I was "webmaster" for a site with ads. To justify ad sales, the company paid for a web server log audit service. This provided the main log reports looked at by the company, but sometimes I'd be called on to explain things. So I had to dive into the logs and examine them myself.
logprint. Today this tool is not going to be widely useful, instead
people will use an ELK stack and define a
grok rule in the
of ELK. But initial release of
logstash was 2010, long after I wrote
logprint does is parse log files of various formats — I
defined four that I've had to work with, adding more is an excerise
regular expression writing, same as with
grok — into columns.
Some of those columns can be sub-parsed. For example, the Apache request
line column can be broken down into a method ("GET", "POST", "HEAD",
etc), a URI (the actual requested resource) and an optional protocol
(not present for HTTP 0.9 or present as "HTTP/1.0" or "HTTP/1.1").
After parsing the line, it can be filtered: only consider requests that
succeeded (2xx), and were over 200,000 bytes; then selectively print
some of the columns for that entry, say date, referer, URI.
# Apache "combined" has referer as a column ("common" does not) # status >= 200 and status <= 299 is any 2xx response # @uri will only be the local file name, discarding a full hostname # on the request line and CGI parameters logprint -f combined \ -F 'status>=200' -F 'status<=299' -F 'bytes>200000' \ -c date,referer,@uri
Things like parsing the file part into URI when you get a request with
the full URL on the GET line is an unusual need, but I needed it then
and it is still useful now. The same parsing rules for a full URL there
are also available for parsing
Referer: headers, which was once useful
for pulling out search terms used from referring search engines.
logprint is a very handy slice and dice tool for web logs. It can
be combined with another tool I wrote,
adder which aims to be a full
featured "add values up" tool. You can feed in columns of numbers and
get columns of sums. You can feed in columns of numbers and get a sum
per line. You can feed in value and keyword and get sums per keyword.
That last one is rather useful in combination with
# using Apache "common" format, find lines with status 200, # print bytes used and the first directory component of the URI file part # pipe that to adder, # suppress column headers, # use column 0 as value to add, # and column 1 as label logprint -f common --filter status=200 -c bytes,file:@path1 $log | adder -n -b 1 -r 0
That gets output like this (although this was sorted):
/u 14415354750 /favicon.ico 3311323662 /i 655750249 /qz 272329622 /apple-touch-icon.png 218913277 /index.html 62583501 /jpeg 49580565 /qz.css 38188009
So simple to see where the bytes are coming from. Looking at that, I decied I really should better compress the "apple-touch-icon.png". I'm not sure I can get "faveicon.ico" smaller, at least not with the features it has now. And the CSS and other icons in /i/ also got some compression.
Then I looked at bytes per day to see if adding a sitemap helped. It does, but the difference is slight, easy to lose in the weekly cycle. Usage really picked up in April, didn't it?
$ cat by-day-usage #!/bin/sh log="$1" if [ ! -f "$log" ] ; then echo "$0: usage by-day-usage LOGFILE[.gz]"; exit 2 ; fi shift; logprint -f common --filter status=200 -c bytes,date $log $@ | adder -n -b 1 -r 0
And graphed with gnuplot
publishing these log tool scripts
for anyone interested in similar simple
log slicing and dicing. It's not
webalizer but it's not
trying to be either.