Manual:robots.txt/zh

This page is a translated version of the page Manual:Robots.txt and the translation is 62% complete.
Other languages:
Deutsch • ‎English • ‎Tiếng Việt • ‎dansk • ‎español • ‎français • ‎polski • ‎português do Brasil • ‎русский • ‎العربية • ‎فارسی • ‎ไทย • ‎中文 • ‎日本語

robots.txt机器人例外标准的一部分,并可帮助搜索引擎最优化。它描述网络爬虫如何对一个站点建立索引。robots.txt必须被放置在网站根目录下。

范例

阻止建立所有索引

组织所有机器人建立站点页面索引的代码:

User-agent: *
Disallow: /

如果您仅仅想阻止某些网络蜘蛛,将星号替换成爬虫的用户代理即可。

阻止对非条目页面建立索引

MediaWiki生成了许多仅对真人有用的页面:例如旧版本和差异页面倾向于复制文章中的内容。编辑页面和大多数特殊页面是动态生成的,这使得它们仅对人工编辑有用,并且服务、维护成本相对较高。如果没有其他指示,爬虫可能会尝试索引数千个相似的页面,从而使Web服务器负载过重。

使用短URL

如果您使用的是维基百科样式的短网址,很容易防止蜘蛛抓取非文章页面。假设可以通过/wiki/Some_title访问文章,而其他所有内容都可以通过/w/index.php?title=Some_title&someoption=blah获得:

User-agent: *
Disallow: /w/

不过要小心!如果您不小心把这条线放进去:

Disallow: /w

您将阻止访问/wiki目录,搜索引擎将删除您的Wiki!

请注意,此解决方案还会导致CSS、JavaScript和图像文件被阻止,因此Google之类的搜索引擎将无法呈现Wiki文章的预览。 要解决此问题,不需要禁止整个/w目录,只需要禁止index.php

User-agent: *
Disallow: /w/index.php?

这是可行的,因为CSS和JavaScript是通过/w/load.php检索的。 另外,您也可以像在维基媒体网站矩阵上那样进行操作:

User-agent: *
Allow: /w/load.php?
Disallow: /w/

不使用短URL

如果您没有使用短网址 ,则限制爬虫会更加困难。如果您正在运行PHP CGI并且没有美化URL,那么可以通过/index.php?title=Some_title访问文章:

User-agent: *
Disallow: /index.php?diff=
Disallow: /index.php?oldid=
Disallow: /index.php?title=Help
Disallow: /index.php?title=Image
Disallow: /index.php?title=MediaWiki
Disallow: /index.php?title=Special:
Disallow: /index.php?title=Template
Disallow: /skins/

If you are running PHP as an Apache module and you have not beautified URLs, so that articles are accessible through /index.php/Some_title:

User-agent: *
Disallow: /index.php?
Disallow: /index.php/Help
Disallow: /index.php/MediaWiki
Disallow: /index.php/Special:
Disallow: /index.php/Template
Disallow: /skins/

The lines without the colons (:) at the end restrict those namespaces' talk pages.

非英语语言的维基可能需要在这行上方加入翻译语句。

You may wish to omit the /skins/ restriction, as this will prevent images belonging to the skin from being accessed. Search engines which render preview images, such as Google, will show articles with missing images if they cannot access the /skins/ directory.

你也可以尝试

Disallow: /*&

because some robots like Googlebot accept this wildcard extension to the robots.txt standard, which stops most of what we don't want robots sifting through, just like the /w/ solution above. This does however, suffer from the same limitations in that it blocks access to CSS, preventing search engines from correctly rendering preview images. It may be possible to solve this by adding another line Allow: /load.php however at the time of writing this is untested.

Allow indexing of raw pages by the Internet Archiver

You may wish to allow the Internet Archiver to index raw pages so that the raw wikitext of pages will be on permanent record. This way, it will be easier, in the event the wiki goes down, for people to put the content on another wiki. You would use:

# Allow the Internet Archiver to index action=raw and thereby store the raw wikitext of pages
User-agent: ia_archiver
Allow: /*&action=raw

问题

速率控制

You can only specify what paths a bot is allowed to spider. Even allowing just the plain page area can be a huge burden when two or three pages per second are being requested by one spider over two hundred thousand pages.

Some bots have a custom specification for this; Inktomi responds to a "Crawl-delay" line which can specify the minimum delay in seconds between hits. (Their default is 15 seconds.)

不遵守规定的机器人程序

Sometimes a custom-written bot isn't very smart or is outright malicious and doesn't obey robots.txt at all (or obeys the path restrictions but spiders very fast, bogging down the site). It may be necessary to block specific user-agent strings or individual IPs of offenders.

More generally, request throttling can stop such bots without requiring your repeated intervention.

An alternative or complementary strategy is to deploy a spider trap.

蜘蛛抓取与索引的对比

While robots.txt stops (non-evil) bots from downloading the URL, it does not stop them from indexing it. This means that they might still show up in the results of Google and other search engines, as long as there are external links pointing to them. (What's worse, since the bots do not download such pages, noindex meta tags placed in them will have no effect.) For single wiki pages, the __NOINDEX__ magic word might be a more reliable option for keeping them out of search results.

Warning: Display title "Manual:robots.txt/zh" overrides earlier display title "手册:Robots.txt".