手册:打击垃圾信息

This page is a translated version of the page Manual:Combating spam and the translation is 39% complete.

每個wiki就像目前所有的动态网站一样,都是垃圾信息发送者推销产品或网站的常见目标。

MediaWiki提供了许多旨在打击一般破坏行为的功能。 本页面专门處理维基垃圾信息,这种信息通常是自动发送的。

概述

用于打击垃圾信息的常用工具通常分为以下几类:

  • 在某些操作上需要登录和/或验证码,例如编辑、添加外部链接或创建新用户
  • 阻止来自已知黑名单IP地址或运行开放代理的IP的编辑
  • 阻止添加特定不需要的关键字或外部链接的编辑
  • 阻止垃圾信息通常使用的特定用户名和页面标题模式
  • 阻止新用户或匿名用户对特定的经常定位的页面进行编辑
  • 将已知良好的编辑者(例如管理员,常规贡献者)列入白名单,同时对新用户或匿名用户施加限制
  • 從最近被禁用的垃圾信息机器人之中,對现有的帖子執行清理脚本或批量删除 (Extension:大量删除 )

通常会综合使用各种方法,在限定对网站合法用户所造成的干扰的同时,以尽量减少垃圾信息、机器人和开放代理编辑的数量。

请注意,默认情况下不会激活其中许多功能。 如果您在服务器/主机上运行MediaWiki安装,那么您是唯一可以进行必要配置更改的人员! 務必请你的用户協助留意维基垃圾信息(你自己也應这样做),但现在垃圾信息輕易易淹没小型维基社群。 这有助于将标准提高一点点。 但是,您还应注意,这些解决方案都不能被视为完全防止垃圾信息。 定期检查"最近更改"(Special:RecentChanges)是一种有效的做法。

首先尝试最快速的解决方案

打击垃圾信息并不是非常困难。 如果要快速,大幅度地减少垃圾,请先尝试这几个步骤。

如果您仍然遇到问题,请阅读本页的其余部分以获取更多解决方案,并在mediawiki-l上发布以获取帮助。

反垃圾信息的基础设置

验证码

反自动提交的一种较为常见的方法是使用验证码,这个系统试图通过要求用户解决简单问题来区分人与自动提交系统。 MediaWiki的ConfirmEdit 扩展提供了一个可扩充的验证码框架,该框架可在一系列事件中触发,包括有

  • 所有编辑
  • 编辑添加新的,无法识别的外部链接
  • 用户注册

该扩展随附一个默认测试,但这只是一个参考实现,并不打算用于生产。 建议在公共维基上安装ConfirmEdit的维基操作员使用扩展中包含的验证码模块之一(共有五个)。

如果你能根据你的维基受众对它们进行严格定制并经常更新的话,目前最强大的验证码是你自定义的QuestyCaptcha问题。 如今,ReCaptcha 已被大多数垃圾信息发送者击败[1];Asirra CAPTCHA 要求用户区分猫和狗,特别让用户讨厌,但可能有效。

值得注意的是,验证码不仅可以阻止不受欢迎的机器人:如果脚本无法通过验证码,那么屏幕阅读器或其他盲人或视障人士使用的软件或辅助工具也无法通过验证码。 验证码中的一个选项,即"reCAPTCHA"小工具,包含了针对这种情况的替代音频验证码--但有些计算机用户听力测试和阅读测试都不及格,因此这并不是一个完整的解决方案。 您应考虑这种障碍的影响,并在可能的情况下为受影响的用户提供创建账户和贡献的替代方法,这在某些司法管辖区是一项法律要求。[2]

此外,它也不能完全防止您的 wiki 受到垃圾信息的侵害;垃圾信息发送者每解决1,000的验证码,就会向在孟加拉国、中国、印度和许多其他发展中国家雇用人工解码器的公司支付$0.80至$1.20美元。[3] 为此原因,它应与其他机制相结合。

rel="nofollow"

在默认配置下,MediaWiki 会为维基页面中的外部链接添加rel="nofollow",以表明这些链接是用户提供的,可能包含垃圾信息,因此不应被用来影响页面排名算法。 常用的搜索引擎,如Google都会使用该属性。

您可以使用$wgNoFollowLinks 在全站范围内或使用$wgNoFollowNsExceptions 配置变量在每个命名空间中「关闭」这种行为。

仅使用 rel="nofollow" 属性并不能阻止垃圾信息发送者试图在页面上"添加"营销活动,但至少可以防止他们通过提高页面排名来获益;我们可以肯定,有些发送者会检查这一点。 不过,绝不应将其作为控制垃圾信息的主要方法,因为其效果本身就有限。 它不能阻止垃圾信息进入您的网站。

参见 NoIndexHistory。 需要注意的是,在所有外部链接上使用该选项是一种相当严厉的反垃圾信息策略,您可以决定不使用(关闭rel=nofollow选项)。 有关这方面的争论,请参阅Nofollow。 不过,将其作为安装默认设置还是不错的。 这意味着那些不考虑垃圾信息问题的懒惰管理员会倾向于启用该选项。 更多信息,请参阅手册:使用nofollow的成本和好处

反垃圾信息例行程序:有针对性的措施

每个垃圾信息发送者都是不同的,尽管他们看起来都非常相似。 如果一般的应对措施还不够,在采取极端措施之前,可以利用一些工具来解决您遇到的具体问题。

个别页面的保护

通常,同一个页面会被垃圾信息机器人反复攻击。 在垃圾信息机器人创建的页面名中观察到的常见模式包括通常在主要空间之外的讨论页面(例如,Category_talk: 很少使用,因此成为常见目标),以及其他讨论页面

由于无需注册即可编辑的维基站点上的大多数滥用编辑行为都来自匿名来源,因此阻止已注册用户以外的任何人对这些特定页面进行编辑可以防止重新创建已删除的垃圾页面。 通常情况下,任何已经是个人维基上special:log/delete的常客的页面都是页面保护的良好候选者。

  • 单个页面的半保护
    • 此外,这还可以与改变 MediaWiki 的最低要求结合起来,将用户识别为"自动确认"用户。
  • 我们可以对一个或多个与垃圾信息最频繁的网页有链接的网页应用级联保护。 还可以使用此技巧设置一个方便的列表,供管理员使用。

防滥用过滤器

Extension:AbuseFilter 允许有权限的用户创建规则,针对您的 wiki 接收到的特定类型的垃圾信息,自动阻止操作和/或阻止用户。

它可以检查编辑的许多属性,如用户名、用户年龄、添加的文本、添加的链接等。 如果有一名或多名熟练的管理员愿意协助您打击垃圾信息,那么这种方法最为有效。 防滥用过滤器甚至可以有效对付人工辅助的垃圾信息发送者,但需要持续维护,以应对新类型的攻击。

打击自动垃圾信息的示例可在Manual:Combating spam/AbuseFilter examples 上找到。

垃圾邮件黑名单

如果您试图屏蔽的垃圾 URL 的数量有一点多,上述方法就会变得过于麻烦。 更好的办法是建立一个长长的黑名单,以此来识别许多已知的垃圾 URL。

MediaWiki 的一个流行扩展是 SpamBlacklist 扩展,它可以阻止向页面添加黑名单 URL 的编辑行为:它允许在权限用户的协助下在维基上构建这样一个列表,并允许使用从外部来源获取的列表(默认情况下,它使用一个比较热门的m:Spam blacklist)。

TitleBlacklist 扩展名也很有用,它可以防止重新创建特定的网页组,而这些网页组正被机器人用来发布垃圾链接。

开放代理

Open proxies are a danger mostly because they're used as a way to circumvent countermeasures targeted to specific abuser; see also No open proxies.

维基媒体基金会的维基上存在一些机器人,可以检测并阻止开放代理 IP,但它们的代码通常不公开。 Most such blocks are performed manually, when noticing the abuse. It's hence important to be able to tell whether an abusing IP is an open proxy or something else, to decide how to deal with it; even more so if it's an IP used by a registered user, retrieved with the CheckUser extension.

Several extensions, particularly the Tor block extension, blocks a range of open proxies.

从1.22版开始,$wgApplyIpBlocksToXff 可用于使封禁更加有效。

Hardcore measures

The following measures are for the more technical savvy sysadmins who know what they're doing: they're harder to set up properly and monitor; if implemented badly, they may be too old to be still effective, or even counterproductive for your wiki.

$wgSpamRegex

MediaWiki provides a means to filter the text of edits in order to block undesirable additions, through the $wgSpamRegex configuration variable. You can use this to block additional snippets of text or markup associated with common spam attacks.

Typically it's used to exclude URLs (or parts of URLS) which you do not want to allow users to link to. Users are presented with an explanatory message, indicating which part of their edit text is not allowed. Extension:SpamRegex allows editing of this variable on-wiki.

$wgSpamRegex = "/online-casino|buy-viagra|adipex|phentermine|adult-website\.com|display:none|overflow:\s*auto;\s*height:\s*[0-4]px;/i";

This prevents any mention of 'online-casino' or 'buy-viagra' or 'adipex' or 'phentermine'. The '/i' at the end makes the search case insensitive. It will also block edits which attempt to add hidden or overflowing elements, which is a common "trick" used in a lot of mass-edit attacks to attempt to hide the spam from viewers.

Apache配置更改

In addition to changing your MediaWiki configuration, if you are running MediaWiki on Apache, you can make changes to your Apache web server configuration to help stop spam. These settings are generally either placed in your virtual host configuration file, or in a file called .htaccess in the same location as LocalSettings.php (note that if you have a shared web host, they must enable AllowOverride to allow you to use an .htaccess file).

Filtering by user agent

When you block a spammer on your wiki, search your site's access log by IP to determine which user agent string that IP supplied. For example:

grep ^195.230.18.188 /var/log/apache2/access.log

The access log location for your virtual host is generally set using the CustomLog directive. Once you find the accesses, you'll see some lines like this:

195.230.18.188 - - [16/Apr/2012:16:50:44 +0000] "POST /index.php?title=FlemmingCoakley601&action=submit HTTP/1.1" 200 24093 "-" ""

The user agent is the last quoted string on the line, in this case an empty string. Some spammers will use user agent strings used by real browsers, while others will use malformed or blank user agent strings. If they are in the latter category, you can block them by adding this to your .htaccess file (adapted from this page):

SetEnvIf User-Agent ^regular expression matching user agent string goes here$ spammer=yes

Order allow,deny
allow from all           
deny from env=spammer

This will return a 403 Forbidden error to any IP connecting with a user agent matching the specified regular expression. Take care to escape all necessary regexp characters in the user agent string such as . ( ) - with backslashes (\). To match blank user agents, just use "^$".

Even if the spammer's user agent string is used by real browsers, if it is old or rarely encountered, you can use rewrite rules to redirect users to an error page, advising them to upgrade their browser:

RewriteCond %{HTTP_USER_AGENT} "Mozilla/5\.0 \(Windows; U; Windows NT 5\.1; en\-US; rv:1\.9\.0\.14\) Gecko/2009082707 Firefox/3\.0\.14 \(\.NET CLR 3\.5\.30729\)"
RewriteCond %{REQUEST_URI} !^/forbidden/pleaseupgrade.html
RewriteRule ^(.*)$ /forbidden/pleaseupgrade.html [L]

Preventing blocked spammers from consuming resources

A persistent spammer or one with a broken script may continue to try to spam your wiki after they have been blocked, needlessly consuming resources.

By adding a deny from pragma such as the following to your .htaccess file, you can prevent them from loading pages at all, returning a 403 Forbidden error instead:

Order allow,deny
allow from all
deny from 195.230.18.188

IP地址黑名单

Much of the most problematic spam received on MediaWiki sites comes from addresses long known by other webmasters as bot or open proxy sites, though there's only anecdotal evidence for this. These bots typically generate large numbers of automated registrations to forum sites, comment spam to blogs and page vandalism to wikis: most often linkspam, although existing content is sometimes blanked, prepended with random gibberish characters or edited in such a way as to break existing Unicode text.

A relatively simple CAPTCHA may significantly reduce the problem, as may blocking the creation of certain often-spammed pages. These measures do not eliminate the problem, however, and at some point tightening security for all users will inconvenience legitimate contributors.

It may be preferable, instead of relying solely on CAPTCHA or other precautions which affect all users, to target specifically those IPs already known by other site masters to be havens of net.abuse. Many lists are already available, for instance stopforumspam.com has a list of "All IPs in CSV" which (as of feb. 2012) contains about 200,000 IPs of known spambots.

CPU usage and overload

Note that, when many checks are performed on attempted edits or pageviews, bots may easily overload your wiki disrupting it more than they would if it was unprotected. Keep an eye on the resource cost of your protections.

DNSBL

You can set MediaWiki to check each editing IP address against one or more DNSBLs (DNS-based blacklists), which requires no maintenance but slightly increases edit latency. For example, you can add this line to your LocalSettings.php to block many open proxies and known forum spammers:

$wgEnableDnsBlacklist = true;
$wgDnsBlacklistUrls = array( 'xbl.spamhaus.org', 'dnsbl.tornevall.org' );

For details of these DNSBLs, see Spamhaus: XBL and dnsbl.tornevall.org. For a list of DNSBLs, see Comparison of DNS blacklists. See also Manual:$wgEnableDnsBlacklist , 手册:$wgDnsBlacklistUrls .

$wgProxyList

  警告: This particular technique will substantially increase page load time and server load if the IP list is large. Use with caution.

You can set the variable $wgProxyList to a list of IPs to ban. This can be populated periodically from an external source using a cron script such as the following:

#!/bin/bash
cd /your/web/root
wget https://www.stopforumspam.com/downloads/listed_ip_30_ipv46.gz
gzip -d listed_ip_30_ipv46.gz
cat > bannedips.php << 'EOF'
<?php
$wgProxyList = array(
EOF
sed -e 's/^/  "/; s/$/",/' < listed_ip_30_ipv46 >> bannedips.php
printf '%s\n' '");' >> bannedips.php
rm -f listed_ip_30_ipv46

You then set in your LocalSettings.php:

require_once "$IP/bannedips.php";

You may want to save these commands in a file called e.g. updateBannedIPs.sh, so you can run it periodically.

You can also use a PHP-only solution to download the ip-list from stopforumspam. To do so check the PHP script available here.

If you do this and you use APC cache for caching, you may need to increase apc.shm_size in your php.ini to accommodate such a large list.

You have just banned one hundred forty thousand spammers, all hopefully without any disruptive effect on your legitimate users, and said «adieu» to a lot of the worst of the known spammers on the Internet. Good riddance! That should make things a wee bit quieter, at least for a while…

Honeypots, DNS BLs and HTTP BLs

140,000 dead spammers. Not bad, but any proper BOFH at this point would be bored and eagerly looking for the 140,001st spam IP to randomly block. And why not?

Fortunately, dynamically-updated lists of spambots, open proxies and other problem IPs are widely available. Many also allow usernames or email addresses (for logged-in users) to be automatically checked against the same blacklists.

One form of blacklist which may be familiar to MediaWiki administrators is the DNS BL. Hosted on a domain name server, a DNS blacklist is a database of IP addresses. An address lookup determines if an IP attempting to register or edit is an already-known source of net abuse.

The $wgEnableDnsBlacklist and $wgDnsBlacklistUrls options in MediaWiki provide a primitive example of access to a DNS blacklist. Set the following settings in LocalSettings.php and IP addresses listed as HTTP spam are blocked:

$wgEnableDnsBlacklist = true;
$wgDnsBlacklistUrls = array( 'xbl.spamhaus.org', 'opm.tornevall.org' );

The DNS blacklist operates as follows:

  • A wiki gets an edit or new-user registration request from some random IP address (for example, in the format '123.45.67.89')
  • The four IP address bytes are placed into reverse order, then followed by the name of the desired DNS blacklist server
  • The resulting address is requested from the domain name server (in this example, '89.67.45.123.zen.spamhaus.org.' and '89.67.45.123.dnsbl.tornevall.org.')
  • The server returns not found (NXDOMAIN) if the address is not on the blacklist. If it is on either blacklist, the edit is blocked.

The lookup in an externally-hosted blacklist typically adds no more than a few seconds to the time taken to save an edit. Unlike $wgProxyKey settings, which must be loaded on each page read or write, the use of the DNS blacklist only takes place during registration or page edits. This leaves the speed at which the system can service page read requests (the bulk of your traffic) unaffected.

While the original SORBS was primarily intended for dealing with open web proxies and email spam, there are other lists specific to web spam (forums, blog comments, wiki edits) which therefore may be more suitable:

  • .opm.tornevall.org. operates in a very similar manner to SORBS DNSBL, but targets open proxies and web-form spamming.

Much of its content is consolidated from other existing lists of abusive IPs.

  • .dnsbl.httpbl.org. specifically targets bots which harvest email addresses from web pages for bulk mail lists, leave comment spam or attempt to steal passwords using dictionary attacks.

It requires the user register with projecthoneypot.org for a 12-character API key. If this key (for example) were 'myapitestkey', a lookup which would otherwise look like '89.67.45.123.http.dnsbl.sorbs.net.' or '89.67.45.123.opm.tornevall.org.' would need to be 'myapitestkey.89.67.45.123.dnsbl.httpbl.org.'

  • Web-based blacklists can identify spammer's email addresses and user information beyond a simple IP address, but there is no standard format for the reply from an HTTP blacklist server.

For instance, a request for http://botscout.com/test/?ip=123.45.67.89 would return "Y|IP|4" if the address is blacklisted ('N' or blank if OK), while a web request for http://www.stopforumspam.com/api?ip=123.45.67.89 would return "ip yes 2009-04-16 23:11:19 41" if the address is blacklisted (the time, date and count can be ignored) or blank if the address is good.

With no one standard format by which a blacklist server responds to an enquiry, no built-in support for most on-line lists of known spambots exists in the stock MediaWiki package. Since rev:58061, MediaWiki has been able to check multiple DNSBLs by defining $wgDnsBlacklistUrls as an array.

Most blacklist operators provide very limited software support (often targeted to non-wiki applications, such as phpBB or Wordpress). As the same spambots create similar problems on most open-content websites, the worst offenders attacking MediaWiki sites will also be busily targeting thousands of non-wiki sites with spam in blog comments, forum posts and guestbook entries.

Automatic query of multiple blacklist sites is therefore already in widespread use protecting various other forms of open-content sites and the spambot names, ranks and IP addresses are by now already all too well known. A relatively small number of spambots appear to be behind a large percentage of the overall problem. Even where admins take no prisoners, a pattern where the same spambot IP which posted linkspam to the wiki a second ago is spamming blog comments somewhere else now and will be spamming forum posts a few seconds from now on a site half a world away has been duly noted. One shared external blacklist entry can silence one problematic bot from posting on thousands of sites.

This greatly reduces the number of individual IPs which need to be manually blocked, one wiki and one forum at a time, by local administrators.

But what's this about honeypots?

Some anti-spam sites, such as projecthoneypot.org, provide code which you are invited to include in your own website pages.

Typically, the pages contain one or more unique, randomised and hidden email addresses or links, intended not for your human visitors but for spambots. Each time the page is served, the embedded addresses are automatically changed, allowing individual pieces of spam to be directly and conclusively matched to the IP address of bots which harvested the addresses from your sites. The IP address which the bot used to view your site is automatically submitted to the operators of the blacklist service. Often a link to a fake 'comment' or 'guest book' is also hidden as a trap to bots which post spam to web forms. See Honeypot (computing).

Once the address of the spammer is known, it is added to the blacklists (see above) so that you and others will in future have one less unwanted robotic visitor to your sites.

While honeypot scripts and blacklist servers can automate much of the task of identifying and dealing with spambot IPs, most blacklist sites do provide links to web pages on which one can manually search for information about an IP address or report an abusive IP as a spambot. It may be advisable to include some of these links on the special:blockip pages of your wiki for the convenience of your site's administrators.

More lists of proxy and spambot IPs

Typically, feeding the address of any bot or open proxy into a search engine will return many lists on which these abusive IPs have already been reported.

In some cases, the lists will be part of anti-spam sites, in others a site advocating the use of open proxies will list not only the proxy which has been being abused to spam your wiki installation but hundreds of other proxies like it which are also open for abuse. It is also possible to block wiki registrations from anonymised sources such as Tor proxies (Tor Project - torproject.org), from bugmenot fake account users or from email addresses (listed by undisposable.net) intended solely for one-time use.

See also Blacklists Compared - 1 March 2008 and spamfaq.net for lists of blacklists. Do keep in mind that lists intended for spam email abatement will generate many false positives if installed to block comment spam on wikis or other web forms. Automated use of a list that blacklists all known dynamic user IP address blocks, for instance, could render your wiki all but unusable.

To link to IP blacklist sites from the Special:Blockip page of your wiki (as a convenience to admins wishing to manually check if a problem address is an already-known bot):

  1. Add one line to LocalSettings.php to set: $wgNamespacesWithSubpages [NS_SPECIAL] = true;
  2. Add the following text in MediaWiki:Blockiptext to display:
"Check this IP at [http://whois.domaintools.com/{{SUBPAGENAME}} Domain Tools], [http://openrbl.org/?i={{SUBPAGENAME}} OpenRBL], [http://www.projecthoneypot.org/ip_{{SUBPAGENAME}} Project Honeypot], [http://www.spamcop.net/w3m?action=checkblock&ip={{SUBPAGENAME}} Spam Cop], [http://www.spamhaus.org/query/bl?ip={{SUBPAGENAME}} Spamhaus], [http://www.stopforumspam.com/ipcheck/{{SUBPAGENAME}} Stop Forum Spam]."

This will add an invitation to "check this IP at: Domain Tools, OpenRBL, Project Honeypot, Spam Cop, Spamhaus, Stop Forum Spam" to the page from which admins ask to block an IP. An IP address is sufficient information to make comments on Project Honeypot against spambots, Stop Forum Spam is less suited to reporting anon-IP problems as it requires username, IP and email under which a problem bot is attempting to register on your sites. The policies and capabilities of other blacklist-related websites may vary.

Note that blocking the address of the spambot posting to your site is not the same as blocking the URLs of specific external links being spammed in the edited text. Do both. Both approaches used in combination, as a means to supplement (but not replace) other anti-spam tools such as title or username blacklists and tests which attempt to determine whether an edit is made by a human or a robot (captchas or akismet) can be a very effective means to separate spambots from real, live human visitors.

If spam has won the battle

You can still win the war! MediaWiki offers you the tools to do so; just consolidate your positions until you're ready to attack again. See 手册:应对垃圾骚扰 and in particular Cleaning up, Restrict editing.

See External links for other tools without MediaWiki support.

其他主意

This page lists features which are currently included, or available as patches, but on the discussion page you will find many other ideas for anti-spam features which could be added to MediaWiki, or which are under development.

参见

扩展

  • AbuseFilter allows edit prevention and blocking based on a variety of criteria
  • A slimmed down ConfirmAccount can be used to moderate new user registrations, (doesn't require captchas).
  • CheckUser allows, among other things, the checking of the underlying IP addresses of account spammers to block them. Allows mass-blocking of spammers from similar locations.
  • FlaggedRevs
  • HoneyPot
  • SpamRegex allows basic blocking of edits containing spam domains with a single regex
  • StopForumSpam allows for checking edits against the StopForumSpam service and allows for submitting data back to it when blocking users.
  • Category:Spam management extensionscategory exhaustively listing spam management extensions
  • Moderation don't show edits to normal users until approved by a moderator. This extension has the advantage that spam links are never shown to the public, so not creating incentive to post spam.

Useful only on some wiki farms:

Commercial services:

Bundled in the installer

The standard tarball available for download now contains most of the main anti-spam extensions, including the following:

  • ConfirmEdit adds various types of CAPTCHAs to your wiki
  • Nuke removes all contributions by a user or IP
  • SpamBlacklist prevents edits containing spam domains, list is editable on-wiki by privileged users


设置

外部链接


References

  1. user Senukexcr说:例如:«自动解决验证码: GSA Captcha Breaker + Mega Ocr (Solves Recaptcha!)».
  2. 例如,第508节 电子和信息技术标准
  3. The New York Times 25 April 2010 Spammers Pay Others to Answer Security Tests by Vikas Bajaj