QZ qz thoughts
a blog from Eli the Bearded
Tag search results for blosxom|administrivia Page 1 of 6

A Protein Alphabet


Mark Howarth, having looked at a lot of protien visualizations, realized that there is enough diversity to easily create an alphabet of protein shapes.

This is all the more impressive since the proteins are all three dimensional shapes, and only look like letters from certain angles.

I used the 3-D visualizer at the protein database he links to create my own images of the Q (3SZV, Pseudomonas aeruginosa OccK3) and Z (4BTA, Peptide(pro-pro-gly)3 bound complex of N- terminal domain and peptide substrate binding domain of prolyl-4 hydroxylase (residues 1-244) type I) letters.

QZ protein images

And then made a QZ logo for the blog out of them. First new logo in a while, and currently the largest one in the collection (by file size). Very, very, few of more than one thousand QZ images in the logo collection are color, which makes having them small much easier.

Web Log Tools


As in tools for web server logs, not the web logs commonly called "blogs".

In the early 2000s, I was doing a lot of very specific log analysis. At the time I was "webmaster" for a site with ads. To justify ad sales, the company paid for a web server log audit service. This provided the main log reports looked at by the company, but sometimes I'd be called on to explain things. So I had to dive into the logs and examine them myself.

Enter logprint. Today this tool is not going to be widely useful, instead people will use an ELK stack and define a grok rule in the logstash part of ELK. But initial release of logstash was 2010, long after I wrote logprint.

What logprint does is parse log files of various formats — I defined four that I've had to work with, adding more is an excerise regular expression writing, same as with grok — into columns. Some of those columns can be sub-parsed. For example, the Apache request line column can be broken down into a method ("GET", "POST", "HEAD", etc), a URI (the actual requested resource) and an optional protocol (not present for HTTP 0.9 or present as "HTTP/1.0" or "HTTP/1.1"). After parsing the line, it can be filtered: only consider requests that succeeded (2xx), and were over 200,000 bytes; then selectively print some of the columns for that entry, say date, referer, URI.

# Apache "combined" has referer as a column ("common" does not)
# status >= 200 and status <= 299 is any 2xx response
# @uri will only be the local file name, discarding a full hostname
#      on the request line and CGI parameters
logprint -f combined \
	-F 'status>=200' -F 'status<=299' -F 'bytes>200000' \
	-c date,referer,@uri

Things like parsing the file part into URI when you get a request with the full URL on the GET line is an unusual need, but I needed it then and it is still useful now. The same parsing rules for a full URL there are also available for parsing Referer: headers, which was once useful for pulling out search terms used from referring search engines.

So logprint is a very handy slice and dice tool for web logs. It can be combined with another tool I wrote, adder which aims to be a full featured "add values up" tool. You can feed in columns of numbers and get columns of sums. You can feed in columns of numbers and get a sum per line. You can feed in value and keyword and get sums per keyword. That last one is rather useful in combination with logprint.

# using Apache "common" format, find lines with status 200,
# print bytes used and the first directory component of the URI file part
#   pipe that to adder,
#       suppress column headers, 
#	use column 0 as value to add,
#	 and column 1 as label
logprint -f common --filter status=200 -c bytes,file:@path1 $log |
    adder -n -b 1 -r 0

That gets output like this (although this was sorted):

/u      14415354750
/favicon.ico    3311323662
/i      655750249
/qz     272329622
/apple-touch-icon.png   218913277
/index.html     62583501
/jpeg   49580565
/qz.css 38188009

So simple to see where the bytes are coming from. Looking at that, I decied I really should better compress the "apple-touch-icon.png". I'm not sure I can get "faveicon.ico" smaller, at least not with the features it has now. And the CSS and other icons in /i/ also got some compression.

Then I looked at bytes per day to see if adding a sitemap helped. It does, but the difference is slight, easy to lose in the weekly cycle. Usage really picked up in April, didn't it?

bytes per day graph
$ cat by-day-usage
#!/bin/sh
log="$1"
if [ ! -f "$log" ] ; then echo "$0: usage by-day-usage LOGFILE[.gz]"; exit 2 ; fi
shift;
logprint -f common --filter status=200 -c bytes,date $log $@ |
    adder -n -b 1 -r 0

And graphed with gnuplot

So I'm publishing these log tool scripts for anyone interested in similar simple log slicing and dicing. It's not awstats or webalizer but it's not trying to be either.

So, Why Blosxom?


Although it was moderately popular when new, calling Blosxom a dead blog tool now is fairly accurate. No one is using it for new sites and many former power users — people dedicated and involved enough to write plugins &mhash; have abandoned the platform. So it probably bears answering "Why do I still want it?"

Here are some of the things Blosxom has going against it:

  1. No active development to lean on for community improvements.
  2. Somewhat simplistic hook model for plugins.
  3. Very rudimentary interpolation engine.
  4. Very easy to accidentally change posting time on posts.
  5. Without plugins, lacks many features considered standard now:
    • Comments
    • Post composer
    • Search
    • Search engine tools like sitemap support
    • Cookies for analytics, user preferences, and/or user logins

The main selling points Rael Dornfest had for Blosxom, as I remember it, where:

  1. Edit posts without a post composer.
  2. Import and export of posts is trivial since they are all just individual text files.
  3. Small code base with easy install on your own server.
  4. Simple to create plugins.

Most of those are not things I think people appreciate. GUI composers are very common these days, some more WYSIWYG then others, but having buttons for bolding, dialogs for links, etc, seems to be a thing people want. And maintaining code, installing things on an Internet server, that seems to to be things people don't want. You can get started in Tumblr in seconds after getting an email address and an Internet connection. Finding somewhere to install a Perl script, configure it to work with a web server, and then "how do you add images?" is too much.

So you've got deliberate features people don't care about, and drawbacks people will quickly notice. Blosxom is a hard sell these days.

But for me, it is what I have always done. My first forays into web page construction were done composed in vi, served by NCSA HTTPd, and viewed in Mosaic on university computers. From there, moved to a Unix shell account on an ISP by 1996. I had my own personal colocated server serving content on my own personal domain name by 1997, and was saturating a T1 at times by 1998. All of that original work was 100% my HTML and CGI coding. I wrote a CGI libary in 1999 that I still use for some personal projects.

For work, I've used blogging tools like Moveable Type and Wordpress. I've used content management systems like Plone and Drupal. I've stored content in Berkeley DB files, MySQL, and Postgres. I've worked with content accelerators like Varnish, Fastly (which is basically Varnish managed by someone else), and Akamai. I understand how and why to scale horizontally or vertically in Internet deployments.

I like coming back to the simplicity of knowing how every bit of the page gets transmitted from the first line of the HTTP request to the closing <BODY> tag in the HTML. That's what I get out of Blosxom. It is tiny and knowable. I have to do more work to enable things, but I know what that work is or how to find out. Until I did it for the sitemap tool last week, I had never actually built a sitemap, only parsed them for site scraping. It wasn't a hard task from deciding to do it and having it completed, even if it was a task that wouldn't be necessary with other tools..

Last month, May 2020, qaz.wtf moved 21,650,398,268 bytes in web content, that's without headers, an average of 8083 bytes per second all month long. Most of that (68.9%) was from the Unicode Toys which are configured to send compressed-on-the-fly 100% text html generated by CGI scripts I've 100% written. Second biggest top level item was /favicon.ico, sadly a binary file with a crappy name because Microsoft invented that concept. Third this blog (including images and CSS). If I were using Wordpress, I doubt the pages would be as small, and if I were using Medium the Javascript alone per page would be more than my blog HTML content put together.

I'm happy controlling it all and knowing where my byte budget goes.

Sitemap Plugin


About a month ago I posted about finding a lot of Blosxom plugins on github. I've been looking at some them. There are a family of them, one original and a few modifications to the original, for enabling comments. I have not gotten those to work: best I've gotten is I can leave comments, but not see them on pages. I may end up writing my own, so that it follows my idea of what is needed for comments.

But also in that batch of plugins was one for Google Sitemap. The documentation is non-existant in the repo. Searching the web I did find blog posts from the author, in Japanese. From those I gather the version in the github repo is the old, memory intensive way to build a sitemap. I didn't find the new improved version.

I decided to make do. The gsitemap plugin is dead simple. It just sets a few variables to use in templates, and then when the sitemap flavor is desired, disables pagination. The rest of the magic happens in the flavour templates.

As part of making do, I'm not going to reference or link to the URL that generates the output (if you are reading and want to use this for your own site, the gsitemap plugin in the original configuration would generate it for https://example.com/blosxom/index.xml, assuming /blosxom/ is the root of your Blosxom blog).

Instead I've reconfigured the templates to generate a fragment of a sitemap XML file, changed the flavour to sitemap from xml, and have scripted up a sitemap builder for the whole qaz.wtf site that curls the proper URL and includes() the xml fragment. Blosxom then can remain the source of truth for blog permalinks while find and some per-directory configuration can build URLs for other parts of the site.

I decided I should run that script from SAVE-DATES.sh under the theory that any time I save post timestamps is a likely time I want to rebuild the sitemap. This works for qaz.wtf because the blog is the only thing updating more frequently than monthly, and I typically run SAVE-DATES.sh shortly after posting an entry.

This is all prompted by looking (again) at just how much bot traffic the site gets. I figure a sitemap will stop well-behaved bots from crawling as much or as frequently. And for non-well-behaved bots, I've belt and suspendered things by adding entries to robots.txt and more user-agents to my browser_block plugin.

Similarly in the name of improving search engine interaction, I've got a new (trivial) plugin called extrameta that gets used by other plugins, namely the newly modified tags plugin and pagination plugin to add a <meta name="robots" content="noindex"> header (in a naive way) to search result pages, to avoid duplicated content.