Hello guys,
Lately i've been up to a task of deploying local MediaWiki. Everything went smooth until it came to indexing inside of pdf files that contain characters other that US ascii. Doing '?action=cirrusDump' and looking at 'file_text' field shows that all cyrillic characters are getting dropped while latin characters are preserved. Folks at ru.wikipedia.org somehow managed to do it but i couldn't find solution online. I would be very thankful if somebody could point out why that happens and how i could potentially solve this problem.
My configuration is:
MediaWiki - 1.36.1
PHP - 7.4.22 (apache2handler)
PostgreSQL - 13.3
ICU - 66.1
Elasticsearch - 6.5.4
PDF Handler - c9705a8
AdvancedSearch - c8a42b8
CirrusSearch - 6.5.4 (ab802b7)
Elastica - 6.1.3 (9f6e66a)
My elasticsearch configuration:
analysis-icu
extra MediaWiki plugin
ingest-attachment