QZ redux

Tag search results for bots Page 1 of 2

Long Silence Again

With no feedback from people reading, it's really hard for me to maintain motivation for writing. So I stop and write in places I do get feedback. That's been Net News (eg Usenet and Usenet like) forever and Mastodon-flavored Fediverse recently.

I created a Mastodon account perhaps four years ago, but due to the "no feedback" thing, and not knowing anyone else on the platform, it got little use until Musk started his Twitter purchase and then Twitter destruction. So Twitter link no-more on Contact section and Fediverse QAZ link instead.

I've also revamped the robots.txt file, because of other Internet "enshitification". Google is useless as a search engine now, time to drop their bot. At the same time, I added some more "SEO" related bots and the one "AI" bot I've noticed ("ChatGPT", which people tell me is phonetically the same the French phrase "Cat, I farted", « Chat, j'ai pété »).

Part of the prompt for robots.txt was a persistent highly personalized campaign from some Internet advertising company urging me to put ads on my site through their service. I suspect one of the "SEO" metrics company was how I came to this guy's attention. Better to just block those bozos. Web advertising is just a downhill spiral to the worst profit motives on the web.

[blog (79), 2023 (2), twitter (1), mastodon (1), bots (4)] [2023 Sep 03] [permalink]

Sitemap Plugin

About a month ago I posted about finding a lot of Blosxom plugins on github. I've been looking at some them. There are a family of them, one original and a few modifications to the original, for enabling comments. I have not gotten those to work: best I've gotten is I can leave comments, but not see them on pages. I may end up writing my own, so that it follows my idea of what is needed for comments.

But also in that batch of plugins was one for Google Sitemap. The documentation is non-existant in the repo. Searching the web I did find blog posts from the author, in Japanese. From those I gather the version in the github repo is the old, memory intensive way to build a sitemap. I didn't find the new improved version.

I decided to make do. The gsitemap plugin is dead simple. It just sets a few variables to use in templates, and then when the sitemap flavor is desired, disables pagination. The rest of the magic happens in the flavour templates.

As part of making do, I'm not going to reference or link to the URL that generates the output (if you are reading and want to use this for your own site, the gsitemap plugin in the original configuration would generate it for https://example.com/blosxom/index.xml, assuming /blosxom/ is the root of your Blosxom blog).

Instead I've reconfigured the templates to generate a fragment of a sitemap XML file, changed the flavour to sitemap from xml, and have scripted up a sitemap builder for the whole qaz.wtf site that curls the proper URL and includes() the xml fragment. Blosxom then can remain the source of truth for blog permalinks while find and some per-directory configuration can build URLs for other parts of the site.

I decided I should run that script from SAVE-DATES.sh under the theory that any time I save post timestamps is a likely time I want to rebuild the sitemap. This works for qaz.wtf because the blog is the only thing updating more frequently than monthly, and I typically run SAVE-DATES.sh shortly after posting an entry.

This is all prompted by looking (again) at just how much bot traffic the site gets. I figure a sitemap will stop well-behaved bots from crawling as much or as frequently. And for non-well-behaved bots, I've belt and suspendered things by adding entries to robots.txt and more user-agents to my browser_block plugin.

Similarly in the name of improving search engine interaction, I've got a new (trivial) plugin called extrameta that gets used by other plugins, namely the newly modified tags plugin and pagination plugin to add a <meta name="robots" content="noindex"> header (in a naive way) to search result pages, to avoid duplicated content.

[2020 (74), blosxom (14), plugin (10), administrivia (11), bots (4), web-defense (2)] [2020 May 28] [permalink]

Bot Traffic, Again

One of annoying things I had happen last time this blog was in active use was getting hammered by a rogue bot. It has happened again.

blog hits from 12am March 1st to 2pm March 9th	35121
blog hits in that time not from bots	528

Hits by bot:

count	User-Agent
27543	"Mozilla/5.0 (compatible; MegaIndex.ru/2.0; +http://megaindex.com/crawler)"
4998	"Mozilla/5.0 (compatible; MJ12bot/v1.4.8; http://mj12bot.com/)"
1001	"Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.96 Mobile Safari/537.36 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
449	"Mozilla/5.0 (compatible; YandexBot/3.0; +http://yandex.com/bots)"
216	"Mozilla/5.0 (compatible; AhrefsBot/6.1; +http://ahrefs.com/robot/)"
114	"Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.92 Mobile Safari/537.36 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
110	"istellabot/t.1.13"
74	"Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)"
65	"Mozilla/5.0 (compatible; DotBot/1.1; http://www.opensiteexplorer.org/dotbot, help@moz.com)"
37	"msnbot/2.0b (+http://search.msn.com/msnbot.htm)"
34	"Mozilla/5.0 (compatible; SemrushBot/1.0~bm; +http://www.semrush.com/bot.html)"
32	"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.75 Safari/537.36 (compatible; SMTBot/1.0; +http://www.similartech.com/smtbot)"
22	"Mozilla/5.0 (compatible; YandexImages/3.0; +http://yandex.com/bots)"
16	"PHP-Curl-Class/8.0.1 (+https://github.com/php-curl-class/php-curl-class) PHP/7.0.33-0ubuntu0.16.04.12 curl/7.47.0"
16	"Mozilla/5.0 (compatible; SemrushBot/6~bl; +http://www.semrush.com/bot.html)"
16	"Mozilla/5.0 (iPhone; CPU iPhone OS 7_0 like Mac OS X) AppleWebKit/537.51.1 (KHTML, like Gecko) Version/7.0 Mobile/11A465 Safari/9537.53 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)"
16	"SearchAtlas.com SEO Crawler"
13	"Mozilla/5.0 (compatible; Linux x86_64; Mail.RU_Bot/2.0; +http://go.mail.ru/help/robots)"
12	"Mozilla/5.0 (compatible; Linespider/1.1; +https://lin.ee/4dwXkTH)"
11	"Jigsaw/2.3.0 W3C_CSS_Validator_JFouffa/2.0 (See <http://validator.w3.org/services>)"
10	"Validator.nu/LV http://validator.w3.org/services"
10	"Mozilla/5.0 (Linux; Android 7.0;) AppleWebKit/537.36 (KHTML, like Gecko) Mobile Safari/537.36 (compatible; AspiegelBot)"
10	"Mozilla/5.0 (compatible;Linespider/1.1;+https://lin.ee/4dwXkTH)"
9	"Mozilla/5.0 (compatible; SEOkicks; +https://www.seokicks.de/robot.html)"
7	"Googlebot-Image/1.0"
6	"Mozilla/5.0 (compatible; YandexBot/3.0; +http://yandex.com/bots) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.106"
4	"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/534+ (KHTML, like Gecko) BingPreview/1.0b"
4	"Sogou web spider/4.0(+http://www.sogou.com/docs/help/webmasters.htm#07)"
2	"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/600.2.5 (KHTML, like Gecko) Version/8.0.2 Safari/600.2.5 (Applebot/0.1; +http://www.apple.com/go/applebot)"
2	"Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebit/53.7.36 (KHTML, like Gecko) Chrome/63.0.3239.0 Safari/537.36 (compatible; Linespider/1.1; +https://lin.ee/4dwXkTH)"
2	"Mozilla/5.0 (compatible; Pinterestbot/1.0; +http://www.pinterest.com/bot.html)"
2	"Mozilla/5.0 (compatible;AspiegelBot)"
2	"Mozilla/5.0 (iPhone; CPU iPhone OS 8_1 like Mac OS X) AppleWebKit/600.1.4 (KHTML, like Gecko) Version/8.0 Mobile/12B411 Safari/600.1.4 (compatible; YandexMobileBot/3.0; +http://yandex.com/bots)"
2	"ltx71 - (http://ltx71.com/)"
1	"W3C_Validator/1.3 http://validator.w3.org/services"
1	"DomainStatsBot/1.0 (https://domainstats.com/pages/our-bot)"

Hits by non-bots: 38 unique User-Agents (across ~500 hits)

One user agent really stands out. And one other is suspicious. I'm talking about the two that hit my site more than world-famous Google.

I don't know everything MJ12bot does, but I do know one thing it does is power paid access to "incoming" links reports via "Majestic Site Explorer": "Access raw exports from £79.99 a month". So let me get this, you crawl sites to sell people lists of who links to them? Why should I waste my bandwidth giving you pages?

But clearly it is Megaindex that is abusive. At the .com version of the site I read "MegaIndex is a powerful and versatile competitive intelligence suite for online marketing, from SEO and PPC to social media and advertising research." Again, this is a bullshit use of my resources (bandwidth, web server CPU) for some commercial enterprise that cannot benefit me.

So: another new plugin is born, browser_block. Goodbye Megaindex. Goodbye Majestic.

[blog (79), 2020 (74), blosxom (14), plugin (10), bots (4), web-defense (2)] [2020 Mar 09] [permalink]

Reboot 2020

Sometime in late 2003 I created this blog to be a set of bookmark links for myself and to share. About five years later, in lage 2008 I stopped updating it, by then the updates were very infrequent anyway. No one really visited, I had a nasty experience with MSNbot hammering it and nearly blowing my "free" (base included) bandwidth quota, and no good way to update it remotely. Plus heirarchies of links, like Yahoo pioneered, have really gone out of fashion.

Site	Opened	Closed
Yahoo! (ontological directory portion)	1994	December 2014
DMOZ (Directory MOZilla)	1998	March 2017
Curlie (DMOZ spin-off)	2017	not yet

BUT I never took it down and it continued to exist as a ghost site for years. Now I have decided to reboot. I've left the existing entries as a historical archive, but many of those links have gone stale. As of today, I've refreshed the config to prefer https, switched the domain to my personal one, and made some trivial changes to the page template. Blosxom, the blog software used, no longer has a homepage at raelity.org; tribe.net no longer exists to provide a place for commenting (not that anyone ever did); the tiny fonts I like are not going to be popular, so resize; changes like that.

I don't know what I will have here, but I anticipate a mix of just random thoughts and explaining myself (like this post), recipes, reviews of things I've watched or read, photo posts, and technical posts.

[blog (79), 2020 (74), bots (4)] [2020 Feb 22] [permalink]