Extension:TikaAllTheFiles

MediaWiki extensions manual
TikaAllTheFiles
Release status: beta
Implementation Media, Search
Description Using Apache Tika, provides text and metadata extraction for thousands of file types, enabling full-text search of almost any uploaded file
Author(s) Matt Marjanovic (CtapMaddogtalk)
Maintainer(s) Center for Transparent Analysis and Policy
Latest version 2.0.0 (2024-04-20)
Compatibility policy Master maintains backward compatibility.
MediaWiki 1.37+
PHP 8.1+
Database changes No
Composer centertap/tika-all-the-files
License GNU General Public License 3.0 or later
Download
README.md
RELEASE-NOTES.md
Translate the TikaAllTheFiles extension if it is available at translatewiki.net

The TikaAllTheFiles (TATF) extension facilitates full-text search over uploaded files, by using the Apache Tika content analysis toolkit, which "detects and extracts metadata and text from over a thousand different file types".

In practical terms: if you already have Extension:CirrusSearch set up and working on your wiki, TATF will allow you to perform full-text searches over the contents of almost any uploaded file --- not just the PDFs.

TATF's features and capabilities:

  • extract embedded digital text from any type of uploaded file so that it can be indexed for full-text search;
  • extract and index printed text from bitmap image files and from images embedded in document files, e.g., image-only PDF's (requires Tesseract OCR;
  • extract metadata from any type of uploaded file for display on File: pages;
  • index metadata properties along with text, to enable simple searching for properties within full-text search.

Installation edit

This extension can be installed using composer.

The complete installation and configuration instructions can be found in README.md.

Configuration parameters edit

The complete description of configuration parameters can be found in README.md.