Archived Pages
Goals
edit- The Internet Archive wants to help fix broken outlinks on Wikipedia and make citations more reliable. As of October 2015, the English Wikipedia alone contains 734k links to web.archive.org.[1] Are there members of the community who can help build tools to get archived pages in appropriate places? If you would like to help, please discuss, annotate this page, and/or email alexis@archive.org.
- More information is now available at http://blog.archive.org/2013/10/25/fixing-broken-links/. Legoktm is currently not working on the project due to time constraints.
- Note: «As Yasmin AlNoamany showed in "Who and What Links to the Internet Archive", wikipedia.org is the biggest referrer to the Internet Archive».[1]
- Readers should not click external links and see 404s.
- All references should include a permanent link to avoid content drift.
Wayback API
editTo this end, we developed a new Wayback Availability API, that answers if a given URL is archived and currently accessible in the Wayback Machine. API also has timestamp option that will return the closest good capture to that date. For example,
GET http://archive.org/wayback/available?url=example.com
might return
{
"archived_snapshots": {
"closest": {
"available": true,
"url": "http://web.archive.org/web/20130919044612/http://example.com/",
"timestamp": "20130919044612",
"status": "200"
}
}
}
Please visit API documentation page for details.
IA is crawling Wikimedia outlinks
editWe are running specialized crawls to make this API more useful for Wikipedia community:
- As of 2019, the Internet Archive crawls all external links from all Wikimedia projects as soon as they're reported by EventStream, which includes new external links, citations and embeds (previous it followed the feeds on hundreds of IRC channels).
- IA has been bulk-crawling external links periodically since 2011/2012. At some points, all existing links existing at that point on some wikis including the English Wikipedia were archived.
Newly crawled URLs are generally available through the Wayback within a few hours. 87% of the dead links found by the Internet Archive crawler on Wikipedia are archived.
Implementation ideas
editWhat useful tools/services can we develop on top of this? Please help come up with ideas and implementations. For instance:
- Create a visual format/style to include an archived link next to an external link. This could be a small icon similar to / next to the "external link" icon. This is most helpful for links that are often offline or totally dead (see 2.) and even for external links that are not dead, to provide a snapshot in time (see 3.).
- Run bots to fix broken external links. When an external link is dead, query the Wayback Availability API to discover if there is a working archived version of the page. If the page is available in Wayback, either a) rewrite the link to point directly to the archived version, or b) annotate the link to indicate that there is an archived version available, per 1.
- Make citations more time-specific. When someone cites content on the web, they are citing that URL as it exists at that moment in time. Best practice on English Wikipedia is to include a "retrieved on <date>" field in the cite. It would be useful to update all citations to include an estimated date - guessing "retrieved on [revision-date]" when the editor failed to include it. This lets readers find the version of the page that was cited, even if it changes later on. For new citations, Wayback should have an archived version close to that date/time. For older citations, IA may or may not have one. But if an archived version does exist in the Wayback, we could update the archive-link for that URL to the older version.
See also
edit- Scripts
- weblinkchecker.py
- reflinks.py
- Legoktm's proof-of-concept script
- m:InternetArchiveBot, formerly deadlink.php used by Cyberbot II
- fixDeadLinks.php - Maintenance script to replace dead links for the latest archive.org snapshot
- fixArchivedLinks.php - Maintenance script to replace archive.org links for live links
- English Wikipedia
- w:Wikipedia:Link rot - Overview (and many links in the SeeAlso/EL section there)
- w:Wikipedia:Using the Wayback Machine - How-to Guide
- Cyberbot II run as of 2015, including talk pages, see toollabs:deadlinks for a graph
- WaybackMedic 1 and WaybackMedic 2, verified and fixed all Wayback links on enwiki as of 20 August 2016 (~ 1 million wayback links)
- All projects
- w:User:Dispenser/Checklinks - A tool to query, classify, and fix, all external links in a page. Includes Wayback Machine integration.
- w:de:Wikipedia:Defekte Weblinks/Botmeldung: GiftBot is running since November 2015 on the German Wikipedia: it identifies broken links and notifies on talk pages, user talk pages but makes no edits in main namespace as of December 2015
- https://archive.org/details/wikipediaoutlinks , the collection holding the WARCs
- Other archives
- w:fr:Utilisateur:Pmartin/Cache: Wikiwix seems to archive all fr.wiki links, and its copies are linked next to each link via site JavaScript
- Proposals