Offline content generator/2013 Code Sprint
This page is obsolete. It is being retained for archival purposes. It may document extensions or features that are obsolete and/or no longer supported. Do not rely on the information here being up-to-date. |
PDF rendering
Development sprint to replace MediaWiki's aging PDF rendering pipeline
|
See also: Offline content generator/2013 Code Sprint/Brainstorming and Offline content generator/Architecture
IRC Channel: #mediawiki-pdfhack on irc.freenode.net
Git Repository: https://gerrit.wikimedia.org/r/mediawiki/extensions/Collection/OfflineContentGenerator.git
Labs Test Instance: http://mwalker-enwikinews.instance-proxy.wmflabs.org/
Primary participants
edit- Matt Walker
- Max Semenik
- Brad Jorsch
- C. Scott Ananian
- potentially, Jeff Green
- E. Engelhart
Goals
edit- Primary goal
- Resolve dependency on mwlib and PediaPress rendering setup for PDF generation for single-page documents and collections with the Collection extension. Minimum viable release would be an additional PDF output option being available via the Collection extension, to potentially phase out the old rendering pipeline. Rationale: Current service/architecture is not maintainable and we'd like to clean things up before moving things out of our old Tampa data-center into our new primary DC.
- Stretch goals
- Fully replace the old pipeline in the course of the sprint;
- Support for other formats. Highest priority: ZIM for offline use
- Improvements to PDF layout.
- Discarded goals
- Continued support for PediaPress print-on-demand service, completely separate from Wikimedia's internal service;
Requirements
edit- Functional requirements
- Needs to integrate with Collection extension.
- Needs to append legally required licensing information.
- Needs to include images in print-level resolution.
- Architectural questions
- Use Parsoid HTML5 output and PhantomJS for PDF generation? (Spec here: https://www.mediawiki.org/wiki/Parsoid/MediaWiki_DOM_spec)
- need to parse collections (lists of articles in plaintext format, example) to aggregate potentially multiple Parsoid HTML files into one.
- apply some nice transformations
- ideally get author names, image credits
- prepend, append some stuff (maybe TOC)
- phantom.js with rasterize can do basic PDF output from HTML input
- serve individual files through MediaWiki like Collection currently does it?
- Hardware requirements; current system load? Provision VMs to test in Labs?
- How to make service fully puppetized and deployable? Key dependencies? Security aspects e.g. private wikis?
- Caching strategies for PDFs once generated?
Longer term questions
edit- Issues ownership
- Follow-up sprint(s) consistent with TPA migration timeline
- Cross-datacenter failover?
Docs
edit- Little bits of documentation of the old setup here: wikitech:PDF Servers
- Jeff's prior cleanup effort: wikitech:mwlib
- mwoffline, NodeJS based solution, using Parsoid output, to generate ZIM files
- Tentative Architecture
- /Brainstorming
See also
edit- ElectronPdfService − Provides access to the Electron Service for browser based PDF rendering