QZ qz thoughts
a blog from Eli the Bearded
Tag search results for blosxom Page 3 of 4

Bot Traffic, Again

One of annoying things I had happen last time this blog was in active use was getting hammered by a rogue bot. It has happened again.

blog hits from 12am March 1st to 2pm March 9th35121
blog hits in that time not from bots528

Hits by bot:

27543 "Mozilla/5.0 (compatible; MegaIndex.ru/2.0; +http://megaindex.com/crawler)"
4998 "Mozilla/5.0 (compatible; MJ12bot/v1.4.8; http://mj12bot.com/)"
1001 "Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.96 Mobile Safari/537.36 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
449 "Mozilla/5.0 (compatible; YandexBot/3.0; +http://yandex.com/bots)"
216 "Mozilla/5.0 (compatible; AhrefsBot/6.1; +http://ahrefs.com/robot/)"
114 "Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.92 Mobile Safari/537.36 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
110 "istellabot/t.1.13"
74 "Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)"
65 "Mozilla/5.0 (compatible; DotBot/1.1; http://www.opensiteexplorer.org/dotbot, help@moz.com)"
37 "msnbot/2.0b (+http://search.msn.com/msnbot.htm)"
34 "Mozilla/5.0 (compatible; SemrushBot/1.0~bm; +http://www.semrush.com/bot.html)"
32 "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.75 Safari/537.36 (compatible; SMTBot/1.0; +http://www.similartech.com/smtbot)"
22 "Mozilla/5.0 (compatible; YandexImages/3.0; +http://yandex.com/bots)"
16 "PHP-Curl-Class/8.0.1 (+https://github.com/php-curl-class/php-curl-class) PHP/7.0.33-0ubuntu0.16.04.12 curl/7.47.0"
16 "Mozilla/5.0 (compatible; SemrushBot/6~bl; +http://www.semrush.com/bot.html)"
16 "Mozilla/5.0 (iPhone; CPU iPhone OS 7_0 like Mac OS X) AppleWebKit/537.51.1 (KHTML, like Gecko) Version/7.0 Mobile/11A465 Safari/9537.53 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)"
16 "SearchAtlas.com SEO Crawler"
13 "Mozilla/5.0 (compatible; Linux x86_64; Mail.RU_Bot/2.0; +http://go.mail.ru/help/robots)"
12 "Mozilla/5.0 (compatible; Linespider/1.1; +https://lin.ee/4dwXkTH)"
11 "Jigsaw/2.3.0 W3C_CSS_Validator_JFouffa/2.0 (See <http://validator.w3.org/services>)"
10 "Validator.nu/LV http://validator.w3.org/services"
10 "Mozilla/5.0 (Linux; Android 7.0;) AppleWebKit/537.36 (KHTML, like Gecko) Mobile Safari/537.36 (compatible; AspiegelBot)"
10 "Mozilla/5.0 (compatible;Linespider/1.1;+https://lin.ee/4dwXkTH)"
9 "Mozilla/5.0 (compatible; SEOkicks; +https://www.seokicks.de/robot.html)"
7 "Googlebot-Image/1.0"
6 "Mozilla/5.0 (compatible; YandexBot/3.0; +http://yandex.com/bots) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.106"
4 "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/534+ (KHTML, like Gecko) BingPreview/1.0b"
4 "Sogou web spider/4.0(+http://www.sogou.com/docs/help/webmasters.htm#07)"
2 "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/600.2.5 (KHTML, like Gecko) Version/8.0.2 Safari/600.2.5 (Applebot/0.1; +http://www.apple.com/go/applebot)"
2 "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebit/53.7.36 (KHTML, like Gecko) Chrome/63.0.3239.0 Safari/537.36 (compatible; Linespider/1.1; +https://lin.ee/4dwXkTH)"
2 "Mozilla/5.0 (compatible; Pinterestbot/1.0; +http://www.pinterest.com/bot.html)"
2 "Mozilla/5.0 (compatible;AspiegelBot)"
2 "Mozilla/5.0 (iPhone; CPU iPhone OS 8_1 like Mac OS X) AppleWebKit/600.1.4 (KHTML, like Gecko) Version/8.0 Mobile/12B411 Safari/600.1.4 (compatible; YandexMobileBot/3.0; +http://yandex.com/bots)"
2 "ltx71 - (http://ltx71.com/)"
1 "W3C_Validator/1.3 http://validator.w3.org/services"
1 "DomainStatsBot/1.0 (https://domainstats.com/pages/our-bot)"

Hits by non-bots: 38 unique User-Agents (across ~500 hits)

One user agent really stands out. And one other is suspicious. I'm talking about the two that hit my site more than world-famous Google.

I don't know everything MJ12bot does, but I do know one thing it does is power paid access to "incoming" links reports via "Majestic Site Explorer": "Access raw exports from £79.99 a month". So let me get this, you crawl sites to sell people lists of who links to them? Why should I waste my bandwidth giving you pages?

But clearly it is Megaindex that is abusive. At the .com version of the site I read "MegaIndex is a powerful and versatile competitive intelligence suite for online marketing, from SEO and PPC to social media and advertising research." Again, this is a bullshit use of my resources (bandwidth, web server CPU) for some commercial enterprise that cannot benefit me.

So: another new plugin is born, browser_block. Goodbye Megaindex. Goodbye Majestic.

New Plugins!

I've been thinking about writing a tag plug-in, and making that combined pagination / permalinks plug-in I mentioned earlier. But I put that off in order to write some smaller plugins to get myself more familiar with the code.

Both of the new plugins are very similar to the sample one in the plugins developer documentation. One of them fixes how I think blosxom should work: keep the templates out of the blog data directory. (Blosxom does not do that, so that you can have different templates in different subdirectories if you want. I don't want.) The other replaces the template reader to substitute new templates, but only on April Fool's Day (April 1st).

plugins/flavour_dir takes a single configuration parameter to hold the location for the flavours. Now my git tree will match how the blog actually is. (I do not keep the blog posts in git.)

notused-plugins/april_fools takes a single configuration parameter to hold the name of the new flavour to use on April Fool's Day.

And while I was at it, I've updated the css file to make <code> and <pre> tags much more visible; fixed the styles to be more phone friendly (but there's more to do); added a copyright notice; removed all of the HTML validation errors, there are still some warnings.

Once I get tags and tag searching working, I think this will be much better. But it's coming along.

About Post Ordering

One thing that is tricky with Blosxom is that posts are (unless changed by plugin) ordered strictly by modification time. At first blush, this seems very reasonable. But eventually one finds problems.

A naïve backup and restore can totally thrash file modification times, and leave everything in a random unpredicatable order.

# don't backup Blosxom like this, restore will be broken
cp blog/* /mnt/backup-disk/blog/

Then there are subtler things. In my previous use of this blog, I replaced the sorting plugin to show most recent posts on first page, and alphabetal posts on all category pages. That suited my "bookmark blog" usage very well. It doesn't suit my current usage, and I disabled that sort when I restarted.

But when I restarted I also moved the hostname. And I edited the old posts that had references to the hostname. I edited the old posts.... I lost the original time stamps on those two posts, and I don't know when they are from. I probably have the timestamps in a tar backup somewhere, but that's not handy since it's old. (Panix has backups, too, but not in a format easy to get an old timesstamp from.) I "solved" that by using touch to change the dates back the same same month of my last posts. It's not right, but it's a lot more right than February 2020.

Enter my quick fix for now, a tool to save time stamps in a way that is very easy to restore from. I call it SAVE-DATES and it's a wrapper around a tool I have not used for anything else in many years: super ls (sls). Quoting the README I wrote for it:

In 1989 this program was released to the world in the form of being posted to the Usenet group comp.sources.unix. It was in volume 18 of the archives, which probably exist out on some dusty ftp server still. (But a quick search did not find such an ftp server in October 2017. The links run cold at the Internet Systems Consortium's sources.isc.org.)

Many of the features of sls were new at the time, but are less radical now. And the growth of interpreted languages for system admin use. Languages like Perl (described as a "replacement" for awk and sed, when posted to c.s.u, and archived in volume 13) and Python provide native access to stat(2). And now there's a straight-out stat(1) in Gnu coreutils.

I found this program in the archives somewhere around 1993 and immediately took a liking to it, and patched it many times for my own needs. Today I barely use it, sls having fallen to the wayside due to several factors: non-standard interface (the printf style output control isn't really printf(3)), the rampant buffer overrun problems in the code, more powerful all-in-one output features available with Perl.

The heart of sls that I still really like is the easy way I can turn a simple directory listing into usable code. And that's how my quick fix works. SAVE-DATES uses sls to create a shell script of touch commands that will restore dates. The whole thing can be run, or lines greped out to selectively (and more speedily) be used.

#              [[CC]YY]MMDDhhmm[.ss]
export SLS_DATEFMT=%Y%m%d%H%M.%S
# use sls to make shell commands to reset date/time on files
slsoutstyle='touch -t %m %n'
find "$datadir" -type f -exec sls -p "$slsoutstyle" {} + > "$savefile"

The environment variable makes the date format the same as the date in format touch wants. Then I have the output format of sls include the fixed strings needed for touch to run, plus the modification time (%m) and name (%n). Note that the names are raw, so it relies on me picking "safe" names for post files, but that's my habit anyway. The output looks like this:

touch -t 202002222054.00 /net/u/3/e/eli/public_html/qz/data/blog/feb-2020-a-reboot.txt
touch -t 202002230325.45 /net/u/3/e/eli/public_html/qz/data/blog/how-its-built.txt
touch -t 202002240050.59 /net/u/3/e/eli/public_html/qz/data/blog/haircuts.txt
touch -t 202002250217.49 /net/u/3/e/eli/public_html/qz/data/blog/first-patch.txt
touch -t 202002270203.01 /net/u/3/e/eli/public_html/qz/data/blog/very-impressive-phone.txt

Date restore is easy-peasy now.

First patch!

A couple of things I want for here: permalinks to posts and pagination to browse through older posts. I had the cooluri plugin for permalinks and the pagination plugin for, well, pagination. Each worked on their own, but installed with all the others I was having issues.

I dug in and figured it out. I've renamed cooluri now to nowcooluri, this forces it after menu (otherwise on permalink pages the side menu counts are all broken), and have it set a global $use_permalink when it's filtering is in place. And if $use_permalink is set, disable the pagination in paginationqz (renamed just to show it's a local version).

And when $use_permalink is set, disable the category filtering in blosxom itself. See 2020-02-24-permalinks.patch

Longer term, I think I will combine the permalink and pagination into a single plug-in, so that on a page for a single entry I have have next/previous links to newer and older single entries.

PS: github.com/Eli-the-Bearded/qaz-blog/