Parsoid/DumpGrepper

The dumpgrepper utility is useful to search XML dumps for specific regexp patterns. With a simple regexp, an enwiki dump can be grepped in ~20 minutes.

The grepper operates on actual wikitext (with XML encoding removed), so there is no need to complicate regexps with entities. It supports JavaScript RegExps.

Installation

edit
npm install -g dumpgrepper

Usage

edit
bzcat /path/to/enwiki-latest-pages-articles.xml.bz2 | dumpgrepper '\| *link *='

See also

edit
  • New 'insource' regexp search on wikitext of WMF wikis: Example query, Bug.
  • User:cscott made a hacked variant that lets you chain conditions, so you can say "pages with this but not that (optionally, on the same line)". See https://github.com/cscott/dumpgrepper. This was just a one-off for a particular wikitext migration; if it is more generally useful it could be cleaned up and merged.