Requests for comment/Text extraction
Currently, Wikimedia sites have API action=query&prop=extracts
that can be used when someone wants a text-only or limited-HTML extract of page content. On November 5, 2014 this API was used over 11 million times, more than half of requests were for plain-text extracts. This RFC discusses the future of this API.
Text extraction | |
---|---|
Component | General |
Creation date | |
Author(s) | Max Semenik (talk) |
Document status | accepted Tim Starling in IRC See Phabricator. |
Core integration vs. separate extension
editThis part of the RFC has been withdrawn as no longer relevant |
---|
Initially, the extract functionality was located in MobileFrontend for practical reasons - it already had a HTML manipulation framework. However, now that it had been cleaned up and integrated into core (includes/HtmlFormatter.php), there's no reason it shouldn't be moved to some more appropriate location. Arguments for integration into core:
Arguments for creating a separate extension:
|
WMF-specific extraction
editCurrently, text extraction consists of two steps:
- Manipulate DOM to remove some tags based on their name, id or class along with their contents. This is needed for example to remove infoboxes or navboxes.
- Remove some tags but keep their content ("flattening"). If a plain-text extract is needed, all tags are flattened, otherwise only some tags like <a> are.
This is adequate for Wikipedia and many other uses that have mostly large chunks of text, however it breaks for some sites like Wiktionary that need more elaborate formatting.
# We already have this:
class ExtractFormatter extends HtmlFormatter {
...
}
# But how about this:
class WiktionaryExtractFormatter extends ExtractFormatter {
...
}
If extracts will be integrated into core custom extraction classes could go to a separate extension (e.g. WikimediaTextExtraction); otherwise they could be part of the main extraction extension.
Extract storage
editCurrently, extracts are generated on demand and cached in memcached, however this results in a bad worst-case behaviour when a lot of extracts are needed at once like for queries over several pages or action=opensearch which returns 10 results by default. Text extraction involves DOM manipulations and text processing (tens of milliseconds) and potentially a wikitext parse in case of cache miss (can easily take seconds or even tens of them).
Such timing is less than optimal, I propose to extract text during LinksUpdate and store it in a separate table. This will allow efficient batch retrieval and 100% immediate availability.
CREATE TABLE text_extracts (
-- key to page_id
te_page INT NOT NULL,
-- Limited-HTML extract
te_html MEDIUMBLOB NOT NULL,
-- Plain text extract
te_plain MEDIUMBLOB NOT NULL,
-- Timestamp for looking up rows needing an update due to code or configuration change
te_touched BINARY(14) NOT NULL,
);
The new table can be on an extension store as it doesn't needs to be strictly in the same DB as wiki's other tables. To decrease storage requirements, extracts should be generated only for pages in certain namespaces (and because the current extraction algorithm was tailored for Wikipedia mainspace anyway).