QZ qz thoughts
a blog from Eli the Bearded
Page 3 of 146

Web Log Tools


As in tools for web server logs, not the web logs commonly called "blogs".

In the early 2000s, I was doing a lot of very specific log analysis. At the time I was "webmaster" for a site with ads. To justify ad sales, the company paid for a web server log audit service. This provided the main log reports looked at by the company, but sometimes I'd be called on to explain things. So I had to dive into the logs and examine them myself.

Enter logprint. Today this tool is not going to be widely useful, instead people will use an ELK stack and define a grok rule in the logstash part of ELK. But initial release of logstash was 2010, long after I wrote logprint.

What logprint does is parse log files of various formats — I defined four that I've had to work with, adding more is an excerise regular expression writing, same as with grok — into columns. Some of those columns can be sub-parsed. For example, the Apache request line column can be broken down into a method ("GET", "POST", "HEAD", etc), a URI (the actual requested resource) and an optional protocol (not present for HTTP 0.9 or present as "HTTP/1.0" or "HTTP/1.1"). After parsing the line, it can be filtered: only consider requests that succeeded (2xx), and were over 200,000 bytes; then selectively print some of the columns for that entry, say date, referer, URI.

# Apache "combined" has referer as a column ("common" does not)
# status >= 200 and status <= 299 is any 2xx response
# @uri will only be the local file name, discarding a full hostname
#      on the request line and CGI parameters
logprint -f combined \
	-F 'status>=200' -F 'status<=299' -F 'bytes>200000' \
	-c date,referer,@uri

Things like parsing the file part into URI when you get a request with the full URL on the GET line is an unusual need, but I needed it then and it is still useful now. The same parsing rules for a full URL there are also available for parsing Referer: headers, which was once useful for pulling out search terms used from referring search engines.

So logprint is a very handy slice and dice tool for web logs. It can be combined with another tool I wrote, adder which aims to be a full featured "add values up" tool. You can feed in columns of numbers and get columns of sums. You can feed in columns of numbers and get a sum per line. You can feed in value and keyword and get sums per keyword. That last one is rather useful in combination with logprint.

# using Apache "common" format, find lines with status 200,
# print bytes used and the first directory component of the URI file part
#   pipe that to adder,
#       suppress column headers, 
#	use column 0 as value to add,
#	 and column 1 as label
logprint -f common --filter status=200 -c bytes,file:@path1 $log |
    adder -n -b 1 -r 0

That gets output like this (although this was sorted):

/u      14415354750
/favicon.ico    3311323662
/i      655750249
/qz     272329622
/apple-touch-icon.png   218913277
/index.html     62583501
/jpeg   49580565
/qz.css 38188009

So simple to see where the bytes are coming from. Looking at that, I decied I really should better compress the "apple-touch-icon.png". I'm not sure I can get "faveicon.ico" smaller, at least not with the features it has now. And the CSS and other icons in /i/ also got some compression.

Then I looked at bytes per day to see if adding a sitemap helped. It does, but the difference is slight, easy to lose in the weekly cycle. Usage really picked up in April, didn't it?

bytes per day graph
$ cat by-day-usage
#!/bin/sh
log="$1"
if [ ! -f "$log" ] ; then echo "$0: usage by-day-usage LOGFILE[.gz]"; exit 2 ; fi
shift;
logprint -f common --filter status=200 -c bytes,date $log $@ |
    adder -n -b 1 -r 0

And graphed with gnuplot

So I'm publishing these log tool scripts for anyone interested in similar simple log slicing and dicing. It's not awstats or webalizer but it's not trying to be either.