Bots are dominating the web

The last few years, web traffic from bots has increased rapidly. Some sources1 claim that automated activity now accounts for over half of all web traffic. This is even more pronounced on small websites like mine, as I don't receive many human visitors.

The table below shows traffic on this website for the last few days:

Path Hits
1 /wp-login.php 2751
2 /wp-admin/index.php 2707
3 /wp-admin/edit.php 2354
4 /wp-admin/profile.php 2353
5 /wp-admin/plugins.php 2351
6 /robots.txt 1131
7 / 994
8 /articles/feed.rss 989
9 /articles/how-i-almost-lost-my-backups/ 314
10 /wp-content/themes/vic/style.css 299

This clearly shows two things:

  • The first useful page for a real visitor is in sixth place2.
  • My stylesheet has a low number of hits compared to the frontpage and my most recently published article. One would expect each unique visitor to download the stylesheet at least once.

That is to say, my own statistics show that I'm mainly serving our robot overlords.

The situation is worse on other kinds of web apps, like my Forgejo instance. It's a single-user instance, and most projects are private, but I do publish a bunch of open-source projects there. Recent bot traffic on applications like these is causing significant load on my server.

Keeping the bots out

Previously, the majority of crawlers were relatively decent. They would crawl your website slowly, making sure not to cause too much load. Their main purpose was serving search results to people looking for the content you publish. Bringing down your website would only result in their own users landing on error pages.

The recent Generative AI (GenAI) craze changed this completely. Multiple companies are now racing to collect as much data as possible. They no longer care what happens with the websites they're crawling. Once they have collected enough content, their GenAI agents provide users with answers directly. The source website is treated as irrelevant. Obviously, as a website owner, I do not want these thankless robots on my property3!

So I went through the standard ways of blocking them:

  1. Asking nicely in robots.txt.
  2. Denying requests based on the User-Agent header.
  3. Blocking IP ranges from known data centers and cloud providers.

None of these methods work.

GenAI companies frequently ignore robots.txt files. They also usually disguise themselves as normal browsers instead of identifying as crawlers. More recently, I've also noticed them increasingly using consumer IP space. Combined with rapidly changing addresses, this makes it impossible to block them by IP address alone.

There are some promising new tools like Anubis that work4, for now.

Giving up

Making sure websites stay online is part of my job. I don't want to have to go through the same hassle for my own web apps when I get home. I'm tired, and the few things I'm hosting just aren't worth it.

So, I decided to take two different approaches:

  1. Websites that are meant to be public, like the one you're reading now. I've always optimized these for high traffic, and will continue to do so. These are built to withstand much more than the AI crawlers I'm currently seeing. I'm no longer blocking any bots on public websites.

  2. Websites (and web apps) that are primarily meant for myself. These are now geoblocked and only accept traffic from Belgium. As these are mainly open source applications that I don't maintain myself, they are harder to optimize. This includes my Forgejo instance. It's sad that part of my open source code disappears behind a firewall. I hope to open it up again once AI companies start behaving more responsibly.

My geoblock page contains contact information. So if people really want to go through the hassle of requesting access (which I very much doubt), they can be allowed in.

  1. Every article I've found claiming this reads like marketing material for a web security firm. For that reason, I decided not to link to any of them. Based on my own traffic stats though, I don't doubt their numbers.

  2. I do read robots.txt on some sites I visit. It's a treasure trove of information about the website you're visiting. But that's probably not normal human behaviour.

  3. I don't really mind AI crawlers, as long as they behave. They usually don't.

  4. Anubis is great! Especially since it lets through the bots that clearly identify as a bot.