Handling web crawlers

With the advent of large language models (LLMs) there have been a growing number of companies aggressively scraping the internet for model training data, often with no regard for directives such as robots.txt files. MediaWiki sites often provide open access to domain specific knowledge contained in site structures that are well defined and easy to scrape. This makes them a prime target for automated data acquisition. Additionally there has been an increase in truly malicious denial-of-service (DoS) attacks, many appearing to originate from compromised internet of things (IoT) devices. This page is an attempt to capture some of the current discussions about these topics and potential solutions.

Restricting access to functionality

edit

One relatively easy solution is to restrict access to certain types of pages to only logged-in users or another privileged user group, because these pages are either expensive to serve, provide crawlers with too many links, or both. Of course, such restrictions present a tradeoff between performance and usability.

The main features that can cause problems are:

  • Page diffs - poorly written scrapers will navigate to the page history and continue to follow all combinations of available diffs. The volume and pace of this crawl is likely to tax the server.
  • MediaWiki special pages - various special pages can cause severe performance issues if accessed by attackers or out-of-control web crawlers. "Recent changes" (Special:RecentChanges), "Related changes" (Special:RecentChangesLinked), "What links here" (Special:WhatLinksHere) and "Main public logs" (Special:Log) can all have a performance hit. The "Users" page (Special:ListUsers) can also have a performance hit, but more importantly, attackers might access this page in order to harass and/or engage in spear phishing on these users.
  • Extension special pages - some extensions provide special pages that let users execute near-arbitrary queries on the data stored by those extensions, which web crawlers (or malicious users) can abuse: Cargo has Special:CargoQuery and Special:Drilldown, Semantic Drilldown has Special:BrowseData, and Semantic MediaWiki has Special:Ask.

Extensions

edit

CrawlerProtection

edit

The CrawlerProtection extension blocks anonymous users from accessing the most expensive wiki features. It automatically blocks access to the pages Special:RecentChangesLinked and Special:WhatLinksHere, as well as to history pages, page diffs, and the viewing of old revisions.

Lockdown

edit

The Lockdown extension provides fine-grained access control that can be used to block anonymous users from accessing specific special pages and actions.

As an example, with Lockdown installed, here is how to restrict access to Special:RecentChangesLinked:

$wgSpecialPageLockdown['Recentchangeslinked'] = ['user'];

And here is how to restrict anonymous access to history pages - which in turn will probably prevent crawlers from reaching diffs:

$wgActionLockdown['history'] = [ 'user' ];

Cargo

edit

The Cargo extension defines several "expensive" special pages, including Special:CargoQuery and Special:Drilldown. You can use the runcargoqueries permission to restrict access to them. Here is code that can be added to LocalSettings.php to prevent anonymous access to these special pages:

$wgGroupPermissions['*']['runcargoqueries'] = false;
$wgGroupPermissions['user']['runcargoqueries'] = true;

Custom PHP

edit

Unfortunately, there is no MediaWiki extension that blocks anonymous users only from viewing diffs, not from the entire history page. However, this can be accomplished fairly simply, by adding the following code to LocalSettings.php:

$wgHooks['MediaWikiPerformAction'][] = function ( $output, $article, $title, $user, $request, $wiki ) {
    $type = $request->getVal( 'type' );
    if ( $type === 'revision' && !$user->isRegistered() ) {
        $output->addWikiTextAsInterface( "'''You must be logged in to view revisions.'''" );
        $output->setStatusCode( 403 );
        return false;
    }
};

Apache

edit

Another approach is to block access to these pages at the web server level, so that no MediaWiki code needs to run. With Apache, the following configuration directive will look for a *UserID cookie and any of the strings Log, RecentChangesLinked or WhatLinksHere in the query string (which also works for rewritten short URLs):

<If "%{HTTP_COOKIE} =~ /[-a-zA-Z_]+UserID=/ || ! %{QUERY_STRING} =~ /(Special%3a(Log|RecentChangesLinked|WhatLinksHere)|action=(edit|history|info|pagevalues|purge|formedit)|oldid=)/">
    Require all granted
</If>
<Else>
    Require all denied
</Else>

Varnish

edit

Similarly to the Apache solution, access can also be blocked for anonymous users at the Varnish level. The following example covers oldid= (ldid= string is used on purpose not to get false hits for old), action=history, action=info, Special:RecentChangesLinked, Special:WhatLinksHere and Special:ExportRDF.

    if ((req.http.Cookie ~ "(cpPosIndex|UseCDNCache|UseDC|vw_wiki_session|vw_wikiToken)") ||
       (req.http.Cookie ~ "(vw_wikiUserID|vw_wikiUserName|wikiEditor-0-toolbar-section)")) {
       # logged in user - do nothing
    } else {
       # anonymous user
       if (req.url ~ "(ldid=|Special:RecentChangesLinked|Special:WhatLinksHere|Special:ExportRDF|action=history|action=info)") {
          return(synth(403, "You must be logged in to view this page"));
       } 
   }

Blocking unwanted bots

edit

Blocking outdated browsers and operating systems

edit

The following Apache configuration can be used for identifying and blocking outdated browsers and operating systems using the BrowserMatch directive:[1]

# Systematically ban old browser versions.
# Ban very old versions of Chrome. For now we're banning up to Chrome 125.
BrowserMatch "Chrome\/[0-9]{1,2}\." bad_browser
BrowserMatch "Chrome\/1([0-1][0-9]|2[0-5])\." bad_browser
# Ban very old versions of Firefox. For now we're banning up to Firefox 100.
BrowserMatch "Firefox\/[0-9]{1,2}\." bad_browser

# Systematically ban very old reported OS versions.
# Ban very old versions of Windows. For now we're banning up to Windows 7.
BrowserMatch "Windows\ NT\ [0-5]\." bad_browser
BrowserMatch "Windows\ NT\ 6\.[01]" bad_browser

# Ban very old versions of Mac OS X. For now we're banning up to Mac OS X 10.6.
# Match Firefox OS string
BrowserMatch "Mac\ OS\ X\ 10\.[0-5];" bad_browser
# Match Safari OS string
BrowserMatch "Mac\ OS\ X\ 10\_[0-5]_" bad_browser

# Implement the block
<IfModule mod_authz_core.c>
    # Apache 2.4+
    Require all granted
    Require not env bad_browser
</IfModule>
<IfModule !mod_authz_core.c>
    # Apache 2.2
    Order allow,deny
    Allow from all
    Deny from env bad_browser
</IfModule>

Other blocking tools

edit
  • Fail2ban - open-source software that scans log files and bans IP addresses with suspicious behavior with firewall rules (iptables, firewalld, etc.); also available via the MediaWiki extension Wiki2Ban
  • go-away - open-source software that performs abuse detection and rule enforcement against low-effort mass AI scraping and bots
  • HAProxy Enterprise - inspects HTTP requests after the TCP connection is made but before the request is forwarded to the application[2]
  • mod_evasive - an open-source Apache library that blocks IPs that make too many requests in a short period
  • open-appsec - a paid service (with free tier available) that uses machine learning to understand how users normally interact with the web application, then detects requests that fall outside of normal operations, and conducts further analysis to decide whether the request is malicious or not

The Wikimedia Foundation has had success with a (still-unreleased) script that counts log requests by IP address/prefix, extracts the most frequent IPs/prefixes, and does a whois lookup to find the associated IP allocations and CIDR blocks, which are then added to Puppet; see this Phabricator ticket.

A project currently under development at the Wikimedia Foundation, named Edge Uniques, plans to implement a first-party cookie, WMF-Uniq, managed solely at the CDN edge, to track minimal visit data to distinguish attackers from legitimate traffic (sometimes on the same network as attackers)

Slowing down unwanted bots

edit

Throttling anonymous users

edit

Another approach is to use aggressive throttling at expensive URLs for anonymous users, while letting registered users use the wiki normally.

Some crawlers use a large pool of IP addresses, even from different ISPs, to crawl sites without any sort of delay between requests, or even using parallel requests. Your server may also be hit with more than one crawler at the same time.

For this situation, it may be possible to impose a quota of expensive requests for all unregistered users, as a whole group. This will allow legitimate unregistered users to use your site normally when there are no bots hammering the web server with lots of requests, but will block access to the expensive URLs for them when the rate of expensive requests is too high.

For this approach, you will need to identify:

  • Logged-in / logged-out users: Look at the UserID cookie.
  • Expensive URLs: match the URL against regular expressions. This is easier if your wiki uses Short URLs.

With this classification, it should be possible to put logged-out users requesting expensive URLs in a throttling bucket. When the bucket gets full, the request will be rejected. The bucket must get reset periodically to allow such requests for a while until they reach the limit again.

nginx

edit

Example for nginx web server.

This code goes inside an http section, usually before a server section.

# perform a pretty aggressive limiting for expensive URLs (non-pretty URLs and special pages), only for anons
# 20r/m means 20 requests during 1 minute. Adjust with the value you (or your web server) feel comfortable with
limit_req_zone $wikiglobal_limit_key zone=wikiglobal:10m rate=20r/m;
# If the request is blocked, return an HTTP 429 error
limit_req_status 429;

# wikiglobal_limit_key is the key that will classify each request in a bucket. Once the bucket gets full, it will block the request
map "$wiki_is_loggedin:$wiki_url_is_expensive" $wikiglobal_limit_key {
  default ""; # ignore not matching. An empty string will not be rate limited
  "0:1" "${request_method}${host}"; # not logged-in and expensive. Vary by request method and host (for wiki families)
}

# Cookies are like dbnameUserID=12345
map $http_cookie $wiki_is_loggedin {
  default 0;
  # presence of *UserID cookie
  "~[a-z]UserID=" 1;
}

# Classification of expensive URLs using Regular Expressions
map $request_uri $wiki_url_is_expensive {
  default 0;
  "~ctype=text/" 0; # ignore javascript and css includes
  "~returnto=" 0; # Allow Special:CreateAccount or Special:Userlogin. Links usually contain the returnto= param. We can't target by name because each language has its own Special page name localized
  "~search=" 0; # Search
  "~offset=" 0; # DPL uses offset for pagination
  # Modify the path prefix depending on your $wgScriptPath
  "~^/w/index.php" 1; # index.php URLs: potentially expensive things. This assumes your wiki uses Short URLs, and normal pages won't be accessed from this route
  # Example of Special page exempt from throttling. It uses translations in several languages. Adapt to the language of your wiki.
  # Modify the path prefix depending on your $wgArticlePath
  "~*^/wiki/(Special|Spezial|Especial|Sp%C3%A9cial|Speci%C3%A1ln%C3%AD|%D7%9E%D7%99%D7%95%D7%97%D7%93|%D7%91%D7%90%D6%B7%D7%96%D7%95%D7%A0%D7%93%D7%A2%D7%A8):(Random|Aleatori|Zuf%C3%A4llig|Page_au_hasard|N%C3%A1hodn%C3%A1_str%C3%A1nka|%D7%A6%D7%95%D7%A4%D7%A2%D7%9C%D7%99%D7%92|%D7%93%D7%A3_%D7%90%D7%A7%D7%A8%D7%90%D7%99|%D7%90%D7%A7%D7%A8%D7%90%D7%99|%D7%93%D7%A3)" 0; # Exclude Special:Random and Special:RandomInCategory
  # Example of Special page throttling. It uses translations in several languages. Adapt to the language of your wiki.
  # Modify the path prefix depending on your $wgArticlePath
  "~*^/wiki/(Special|Spezial|Especial|Sp%C3%A9cial|Speci%C3%A1ln%C3%AD|%D7%9E%D7%99%D7%95%D7%97%D7%93|%D7%91%D7%90%D6%B7%D7%96%D7%95%D7%A0%D7%93%D7%A2%D7%A8):" 1; # All special pages: potentially expensive things
}

For this limit to be applied, you need to set limit_req within the relevant server or location blocks:

limit_req zone=wikiglobal burst=3;

See nginx documentation on rate limiting

Other throttling tools

edit
  • AI Labyrinth - a service provided by Cloudflare (with a free tier available) that uses AI-generated content to slow down, confuse, and waste the resources of crawlers and bots that don’t respect "no crawl" directives
  • Nepenthes - open-source software that generates an endless sequences of pages, each of which with dozens of links, that simply go back into the tarpit

Proof-of-work

edit

Another approach is to use proof-of-work to force every browser to do a small computation before viewing any page; this causes negligible inconvenience for regular users, but (presumably) significant inconvenience for large-scale bots and scrapers. This is a somewhat controversial solution, because it is overall taxing on compute power, with associated environmental harms.

Tools that perform proof-of-work:

  • Anubis - open-source software that uses a proof-of-work challenge to ensure that clients are using a modern browser and are able to calculate SHA-256 checksums

List of tools

edit

The following is the complete list of usable tools mentioned above.

Tool Developer/Maintainer Open Source Free Implementation Notes
AI Labyrinth Cloudflare No Free tier available Web server Uses AI-generated content to slow down, confuse, and waste the resources of crawlers and other bots that don’t respect “no crawl” directives.
Anubis Xe Iaso Yes Yes Web server Uses a proof-of-work challenge to ensure that clients are using a modern browser and are able to calculate SHA-256 checksums.
CrawlerProtection Jeffrey Wang (MyWikis) Yes Yes MW Extension Blocks anonymous users from accessing the most expensive wiki features
Fail2ban Sergey G. Brester Yes Yes Web server Scans log files and bans IP addresses with suspicious behavior with firewall rules (iptables, firewalld, etc.).
go-away DataHoarder Yes Yes Web server Self-hosted abuse detection and rule enforcement against low-effort mass AI scraping and bots.
HAProxy Enterprise HAProxy No No Web servers Inspects HTTP requests after the TCP connection is made but before the request is forwarded to the application.
Lockdown Daniel Kinzler Yes Yes MW Extension Can be used to block anonymous users from accessing specific special pages and actions.
mod_evasive (Apache) Jonathan Zdziarski Yes Yes Web server Blocks IPs that make too many requests in a short period.
Nepenthes Yes Yes Web server Generates an endless sequences of pages, each of which with dozens of links, that simply go back into a the tarpit.
open-appsec Yes Free tier available Web server Uses machine learning to understand how users normally interact with the web application, then uses this information to automatically detect requests that fall outside of normal operations, and conducts further analysis to decide whether the request is malicious or not.
Wiki2Ban Luca Mauri Yes Yes MW Extension Blocks IP addresses that have repeatedly failed authentication efforts, using the Fail2Ban library.

Identifying the crawlers

edit

Many of the crawlers are coming from the known AI companies (below) to build their latest frontier models or for training their data. Some of the crawlers are coming from compromised IoT devices (botnets). This makes blocking difficult since they can appear to be organic and come from unique IPs. Most of the traffic are crawlers, sometime overzealous, but some could be DDoS attacks also utilizing botnets.

AI companies

edit

Some of the largest AI companies deploying crawlers with their publicly disclosed user-agents:

  1. Alibaba
  2. Amazon
  3. Anthropic
  4. ByteDance (Mozilla/5.0 (Linux; Android 5.0) AppleWebKit/537.36 (KHTML, like Gecko) Mobile Safari/537.36 (compatible; Bytespider; spider-feedback@bytedance.com))
  5. Cohere
  6. Common Crawl (CCBot/2.0 (+https://commoncrawl.org/faq/))
  7. DeepSeek
  8. Diffbot (Mozilla/5.0 (compatible; Diffbot/3.0; +http://www.diffbot.com))
  9. Google (Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html))
  10. Hive AI
  11. LAION
  12. Meta (meta-externalagent/1.1 (+https://developers.facebook.com/docs/sharing/webmasters/crawler) & meta-externalfetcher/1.1 (+https://developers.facebook.com/docs/sharing/webmasters/crawler))
  13. Microsoft (Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm))
  14. Mistral (Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; MistralAI-User/1.0; +https://docs.mistral.ai/robots))
  15. OpenAI (Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; GPTBot/1.0; +https://openai.com/gptbot))
  16. Perplexity (Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; PerplexityBot/1.0; +https://perplexity.ai/perplexitybot))
  17. Skrape.ai
  18. Stability AI
  19. xAI
  20. 01.AI
edit

References

edit