QZ qz thoughts
a blog from Eli the Bearded
Tag search results for administrivia Page 1 of 3

Web Log Tools

As in tools for web server logs, not the web logs commonly called "blogs".

In the early 2000s, I was doing a lot of very specific log analysis. At the time I was "webmaster" for a site with ads. To justify ad sales, the company paid for a web server log audit service. This provided the main log reports looked at by the company, but sometimes I'd be called on to explain things. So I had to dive into the logs and examine them myself.

Enter logprint. Today this tool is not going to be widely useful, instead people will use an ELK stack and define a grok rule in the logstash part of ELK. But initial release of logstash was 2010, long after I wrote logprint.

What logprint does is parse log files of various formats — I defined four that I've had to work with, adding more is an excerise regular expression writing, same as with grok — into columns. Some of those columns can be sub-parsed. For example, the Apache request line column can be broken down into a method ("GET", "POST", "HEAD", etc), a URI (the actual requested resource) and an optional protocol (not present for HTTP 0.9 or present as "HTTP/1.0" or "HTTP/1.1"). After parsing the line, it can be filtered: only consider requests that succeeded (2xx), and were over 200,000 bytes; then selectively print some of the columns for that entry, say date, referer, URI.

# Apache "combined" has referer as a column ("common" does not)
# status >= 200 and status <= 299 is any 2xx response
# @uri will only be the local file name, discarding a full hostname
#      on the request line and CGI parameters
logprint -f combined \
	-F 'status>=200' -F 'status<=299' -F 'bytes>200000' \
	-c date,referer,@uri

Things like parsing the file part into URI when you get a request with the full URL on the GET line is an unusual need, but I needed it then and it is still useful now. The same parsing rules for a full URL there are also available for parsing Referer: headers, which was once useful for pulling out search terms used from referring search engines.

So logprint is a very handy slice and dice tool for web logs. It can be combined with another tool I wrote, adder which aims to be a full featured "add values up" tool. You can feed in columns of numbers and get columns of sums. You can feed in columns of numbers and get a sum per line. You can feed in value and keyword and get sums per keyword. That last one is rather useful in combination with logprint.

# using Apache "common" format, find lines with status 200,
# print bytes used and the first directory component of the URI file part
#   pipe that to adder,
#       suppress column headers, 
#	use column 0 as value to add,
#	 and column 1 as label
logprint -f common --filter status=200 -c bytes,file:@path1 $log |
    adder -n -b 1 -r 0

That gets output like this (although this was sorted):

/u      14415354750
/favicon.ico    3311323662
/i      655750249
/qz     272329622
/apple-touch-icon.png   218913277
/index.html     62583501
/jpeg   49580565
/qz.css 38188009

So simple to see where the bytes are coming from. Looking at that, I decied I really should better compress the "apple-touch-icon.png". I'm not sure I can get "faveicon.ico" smaller, at least not with the features it has now. And the CSS and other icons in /i/ also got some compression.

Then I looked at bytes per day to see if adding a sitemap helped. It does, but the difference is slight, easy to lose in the weekly cycle. Usage really picked up in April, didn't it?

bytes per day graph
$ cat by-day-usage
if [ ! -f "$log" ] ; then echo "$0: usage by-day-usage LOGFILE[.gz]"; exit 2 ; fi
logprint -f common --filter status=200 -c bytes,date $log $@ |
    adder -n -b 1 -r 0

And graphed with gnuplot

So I'm publishing these log tool scripts for anyone interested in similar simple log slicing and dicing. It's not awstats or webalizer but it's not trying to be either.

Sitemap Plugin

About a month ago I posted about finding a lot of Blosxom plugins on github. I've been looking at some them. There are a family of them, one original and a few modifications to the original, for enabling comments. I have not gotten those to work: best I've gotten is I can leave comments, but not see them on pages. I may end up writing my own, so that it follows my idea of what is needed for comments.

But also in that batch of plugins was one for Google Sitemap. The documentation is non-existant in the repo. Searching the web I did find blog posts from the author, in Japanese. From those I gather the version in the github repo is the old, memory intensive way to build a sitemap. I didn't find the new improved version.

I decided to make do. The gsitemap plugin is dead simple. It just sets a few variables to use in templates, and then when the sitemap flavor is desired, disables pagination. The rest of the magic happens in the flavour templates.

As part of making do, I'm not going to reference or link to the URL that generates the output (if you are reading and want to use this for your own site, the gsitemap plugin in the original configuration would generate it for https://example.com/blosxom/index.xml, assuming /blosxom/ is the root of your Blosxom blog).

Instead I've reconfigured the templates to generate a fragment of a sitemap XML file, changed the flavour to sitemap from xml, and have scripted up a sitemap builder for the whole qaz.wtf site that curls the proper URL and includes() the xml fragment. Blosxom then can remain the source of truth for blog permalinks while find and some per-directory configuration can build URLs for other parts of the site.

I decided I should run that script from SAVE-DATES.sh under the theory that any time I save post timestamps is a likely time I want to rebuild the sitemap. This works for qaz.wtf because the blog is the only thing updating more frequently than monthly, and I typically run SAVE-DATES.sh shortly after posting an entry.

This is all prompted by looking (again) at just how much bot traffic the site gets. I figure a sitemap will stop well-behaved bots from crawling as much or as frequently. And for non-well-behaved bots, I've belt and suspendered things by adding entries to robots.txt and more user-agents to my browser_block plugin.

Similarly in the name of improving search engine interaction, I've got a new (trivial) plugin called extrameta that gets used by other plugins, namely the newly modified tags plugin and pagination plugin to add a <meta name="robots" content="noindex"> header (in a naive way) to search result pages, to avoid duplicated content.

May Updates

A bunch of things of small changes this first week of May.

  1. The qzpostfilt tool has a bug fix with inline <code> and a new close & open paragraph operator: .pp
  2. The is a search box on the blog to run tag searches
  3. The aaa_tags and paginateqz plugins have slight changes to make search results and later pages marked as such
  4. There are flavour template and css changes to go with that
  5. I have finished going through all old blog posts and fixing the easily fixable links.
  6. I have gone through all the old blog posts and fixed non-UTF-8 characters and title formatting.

Twitter Redux

To go along with the revived blog, I'm going to try to revive my Twitter account, also long dormant. I've added it to the side bar under the "contact" heading.

I've been working on blog stuff almost every day for the last week, but there's not a lot to show for it. Most of the work has been going through old posts, looking for dead links to fix or update. That's slow and not very visible work, yet afterwards I feel like I've done stuff for the blog and don't need to do more.

The other thing I've been doing is figuring out why UTF-8 works in tags but not in body text. And the answer seems to be the Perl module FileHandle used by blosxom, which is a wrapper around IO::File these days, doesn't respect the use open IO => ':utf8' pragma. I've filed a bug report but I'll probably have to modify the code to fix that in a more timely fashion. sigh I was hoping to not have to make so many updates to the blosxom program itself.