Parsoid/DumpGrepper
< Parsoid
The dumpgrepper utility is useful to search XML dumps for specific regexp patterns. With a simple regexp, an enwiki dump can be grepped in ~20 minutes.
The grepper operates on actual wikitext (with XML encoding removed), so there is no need to complicate regexps with entities. It supports JavaScript RegExps.
Installation
editnpm install -g dumpgrepper
Usage
editbzcat /path/to/enwiki-latest-pages-articles.xml.bz2 | dumpgrepper '\| *link *='
See also
edit- New 'insource' regexp search on wikitext of WMF wikis: Example query, Bug.
- User:cscott made a hacked variant that lets you chain conditions, so you can say "pages with this but not that (optionally, on the same line)". See https://github.com/cscott/dumpgrepper. This was just a one-off for a particular wikitext migration; if it is more generally useful it could be cleaned up and merged.