Wikimedia Search Platform/Decision Records/Search backend replacement technology
This decision record format is based upon Decision_records#Decision_record_template_v1 and select parts of Data Platform Engineering/Technical Decision Record.
Summary
editSearch Platform plans to migrate from Elasticsearch 7.10 to OpenSearch v2.
Three years ago as licensing changed for Elasticsearch, Search Platform had coalesced on a decision for its eventual replacement with OpenSearch. But a concern was recently surfaced that this could possibly involve loss of language analyzer support for one or more languages.
A cursory review of language analyzer support suggests that this won't be a big problem.
The original decision to migrate to OpenSearch stands.
This decision is in support of the Wikimedia content wikis first and foremost, but inherently is entangled with MediaWiki installations (and local development environments for those who run a backend search engine in their development environment). It's anticipated that some third parties who run MediaWiki with an Elasticsearch backend will need to expend effort to migrate their backend to OpenSearch for ensuring smoothest future compatibility; it's possible there will be incompatibilities with future versions of Elasticsearch for MediaWiki installations.
Decision record basics
editTitle: Migration from Elasticsearch 7.10 to a different search backend replacement technology
Status: Proposed
Authors: Adam Baso, Brian King, David Causse, Erik Bernhardson, Guillaume Lederrey, Peter Fischer, Ryan Kemper, Trey Jones
Deciders: Search Platform team
Consulted: Site Reliability Engineering, Director of Data Platform Engineering and VP overseeing unit
Informed: End users (third-party MediaWiki and technically-inclined Wikimedia wiki users), via wikitech-l and mediawiki-announce and/or similar lists
Date authored: 23-July-2024 first rough draft (see Status below)
Date decided: 10-August-2024
Context:
Previous discussion from three years prior suggested that Search Platform would migrate from Elasticsearch 7.10 to OpenSearch (OpenSearch v1, followed by OpenSearch v2), as opposed to migrating to Elasticsearch 8. Recently there was some question of whether there would be complications (e.g., language analyzer support degradation) in this migration that may necessitate entertaining Elasticsearch 8 as an option.
Between Elasticsearch 7.10 and Elasticsearch 8, Elastic changed its licensing from an Apache license to the SSPL license, which is not OSI compatible. The OpenSearch fork of the software, brought about by Amazon, still utilizes an Apache license. Most Wikimedia software powering Wikimedia content projects (Wikipedia and its sister projects) utilizes OSI-compatibly licensed software, driven in part by a desire to avoid encumbering derivative software, and exceptions to this practice typically involve deeper inspection.
Additional keywords
editFLOSS, FOSS, Elasticsearch, Elastic Search, Open Search, ADR, language support, multilingual search, vector search
Status
edit- 22-July-2024: This decision record begins drafting in response to recent discussion.
- 23-July-2024: Initial first rough draft done.
- 30-July-2024: Most drafting done, moved page under Search Platform decision records path.
- 5-August-2024: Out for review with SRE-at-large.
- 10-August-2024: Review period for SRE-at-large closed.
- 15-August-2024: Moving out of draft (removal of draft template).
Decision-making process
editThis is still a work in progress, but the initial decision is to migrate from Elasticsearch 7.10 to OpenSearch, instead of migrating from Elasticsearch 7.10 to Elasticsearch 8. This is subject to further stakeholder review, as well as the possibility of some unforeseen challenge surfacing during migration work that would require further investigation.
In the event that Elasticsearch 8 (or potentially Elasticsearch 9, depending on Elastic's release timing) needs to be considered once again as an option, further license analysis, competitive forces modeling, and an evaluation of broader alternatives will be required to understand potential tradeoffs.
Search Platform team members have read web pages and looked through source code, documentation, and other public-facing commentary about Elasticsearch 7/8 and OpenSearch, and Trey Jones has performed an initial check of the analyzer / plugin artifacts required for an upgrade to identify risks that could threaten search user experience for Wikimedia project users.
The team talked in a couple meetings about stakeholders. It has notified Wikimedia hosting stakeholders and will be notifying users via public mailing list.
Stakeholders
editThe following stakeholders have been considered primarily:
- Wikimedia wiki end users. It is hoped that the end user experience does not degrade in the course of an upgrade or migration.
- Search Platform team (Search Platform engineering and Data Platform Engineering's SRE unit). It is responsible for implementation and operations of the search engine powering Wikimedia wikis.
- Data Platform Engineering ("DPE") Director and Vice President. They oversee a broader portfolio and are able to assess longer range tradeoffs.
- Site Reliability Engineering. They are responsible for critical support of the servers providing user-facing access to the Wikimedia wikis.
- Third-party MediaWiki installation users. Although the operation of the search stack is largely a Wikimedia wiki matter, there are third-party users of the MediaWiki CirrusSearch extension who rely upon Elasticsearch function indirectly or directly. Any upgrade or migration is likely to require effort by these open source technology users. One notable third-party MediaWiki installation which powers a lot of internationalization and localization work for the Wikimedia wikis and MediaWiki software itself is translatewiki.net.
The Search Platform team is believed to be best suited to making the determination of the technology choice. Given that the technology choice is not at an impasse, this decision appears to not rise to the level of a more formalized decision brief. Nevertheless, this decision record was circulated with the DPE Director and VP, as well as Site Reliability Engineering VP and appropriate staff members, for review, and it is intended to also help contextualize for technologists and end users some of the more salient tradeoffs and projected risks.
Context and problem statement
editSearch Platform has been maintaining Elasticsearch 7.10-based search engine technology, and has held off upgrading this software.
Over two years have passed since the release of Elasticsearch 8, and although Elasticsearch 7.10 has not yet gone out of support, it is slated to be end of life ("EOL") upon the release of Elasticsearch 9 (by the way, Elasticsearch 8 will be EOL 18 months after the release of Elasticsearch 9) according to Elastic. Although major software vendors (or open source forkers) sometimes provide some level of support for critical security patches affecting confidentiality and integrity aspects, and may sometimes provide backported patches for availability aspects, for EOL software, the delivery of such patches, and the general comprehensiveness of attention to EOL software tends to wane.
It is now time to replace Elasticsearch 7.10 in order to preempt EOL issues and to afford improved search capabilities for the future.
As OSI-compatible licensing is preferred, and as Elasticsearch from version 7.11 is no longer OSI-compatible, and as Elastic increasingly requires payment for more advanced features, OpenSearch was thought to be the natural replacement. But discussion in July 2024 surfaced the idea that perhaps there could be significant challenges in switching to OpenSearch as contrasted with ElasticSearch 8.
Risks and mitigations
editThe chief anticipated possible challenges for migration to OpenSearch include:
- A need to migrate analyzers to OpenSearch
- The possibility of certain language support not being present in OpenSearch, necessitating the regression of language analysis capability or the creation of replacement analyzers
- Migration difficulties for third-party MediaWiki installations wishing to run the CirrusSearch MediaWiki plugin in a manner similar to the Wikimedia wiki solution supported by Search Platform
- Migration from the Elastica PHP interface wireup
Analyzer migration
editIt is believed that the migration of analyzers is comparatively low risk, although it entails effort to ensure that any existing analyzers operate at parity on OpenSearch as compared with Elasticsearch 7.10.
Language support
editA brief analysis by Trey suggests that language support should be approximately unchanged in OpenSearch. The following is a copy-paste of most of Trey's analysis from an email, with an editor's note of some copied URLs.
Full list of our installed plugins:
Editor's note: material copied from this URL as of 23-July-2024 (with commented out lines removed and blank lines added for readability):
https://github.com/medcl/elasticsearch-analysis-stconvert/releases/download/v$ELASTICSEARCH_VERSION/elasticsearch-analysis-stconvert-$ELASTICSEARCH_VERSION.zip,none https://artifacts.elastic.co/downloads/elasticsearch-plugins/analysis-icu/analysis-icu-$ELASTICSEARCH_VERSION.zip,D27D666CD88E42B4 https://artifacts.elastic.co/downloads/elasticsearch-plugins/analysis-stempel/analysis-stempel-$ELASTICSEARCH_VERSION.zip,D27D666CD88E42B4 https://artifacts.elastic.co/downloads/elasticsearch-plugins/analysis-smartcn/analysis-smartcn-$ELASTICSEARCH_VERSION.zip,D27D666CD88E42B4 https://artifacts.elastic.co/downloads/elasticsearch-plugins/analysis-ukrainian/analysis-ukrainian-$ELASTICSEARCH_VERSION.zip,D27D666CD88E42B4 https://artifacts.elastic.co/downloads/elasticsearch-plugins/analysis-nori/analysis-nori-$ELASTICSEARCH_VERSION.zip,D27D666CD88E42B4 https://repo1.maven.org/maven2/org/wikimedia/search/highlighter/experimental-highlighter-elasticsearch-plugin/$ELASTICSEARCH_VERSION/experimental-highlighter-elasticsearch-plugin-$ELASTICSEARCH_VERSION.zip,F684F0EC24A878FD https://oss.sonatype.org/service/local/repositories/releases/content/org/wikimedia/search/extra/$ELASTICSEARCH_VERSION-wmf12/extra-$ELASTICSEARCH_VERSION-wmf12.zip,b003a4aef71a5b1a https://repo1.maven.org/maven2/org/wikimedia/search/extra-analysis-homoglyph/$ELASTICSEARCH_VERSION-wmf12/extra-analysis-homoglyph-$ELASTICSEARCH_VERSION-wmf12.zip,b003a4aef71a5b1a https://repo1.maven.org/maven2/org/wikimedia/search/extra-analysis-ukrainian/$ELASTICSEARCH_VERSION-wmf12/extra-analysis-ukrainian-$ELASTICSEARCH_VERSION-wmf12.zip,b003a4aef71a5b1a https://repo1.maven.org/maven2/org/wikimedia/search/extra-analysis-khmer/$ELASTICSEARCH_VERSION-wmf12/extra-analysis-khmer-$ELASTICSEARCH_VERSION-wmf12.zip,b003a4aef71a5b1a https://repo1.maven.org/maven2/org/wikimedia/search/extra-analysis-slovak/$ELASTICSEARCH_VERSION-wmf12/extra-analysis-slovak-$ELASTICSEARCH_VERSION-wmf12.zip,b003a4aef71a5b1a https://repo1.maven.org/maven2/org/wikimedia/search/extra-analysis-serbian/$ELASTICSEARCH_VERSION/extra-analysis-serbian-$ELASTICSEARCH_VERSION.zip,F684F0EC24A878FD https://repo1.maven.org/maven2/org/wikimedia/search/extra-analysis-esperanto/$ELASTICSEARCH_VERSION/extra-analysis-esperanto-$ELASTICSEARCH_VERSION.zip,F684F0EC24A878FD https://repo1.maven.org/maven2/org/wikimedia/search/extra-analysis-turkish/$ELASTICSEARCH_VERSION-wmf12/extra-analysis-turkish-$ELASTICSEARCH_VERSION-wmf12.zip,b003a4aef71a5b1a https://repo1.maven.org/maven2/org/wikimedia/search/extra-analysis-textify/$ELASTICSEARCH_VERSION-wmf12/extra-analysis-textify-$ELASTICSEARCH_VERSION-wmf12.zip,b003a4aef71a5b1a https://people.wikimedia.org/~ebernhardson/ltr-1.5.4-wmf1-es7.10.2.zip,none https://people.wikimedia.org/~dcausse/analysis-hebrew-7.10.2.zip,none https://artifacts.elastic.co/downloads/elasticsearch-plugins/repository-s3/repository-s3-7.10.2.zip,noneOpenSearch core plugins are here:
https://github.com/opensearch-project/OpenSearch/tree/main/plugins
This includes the following plugins:
- analysis-icu (general Unicode support libraries)
- analysis-nori (Korean)
- analysis-smartcn (Chinese)
- analysis-stempel (Polish)
- analysis-ukrainian
- repository-s3 (enables storing snapshots on AWS S3)
I was only a little worried about Nori, SmartCN, Stempel, and Ukrainian, since they were developed by third parties but are in the same place in the Elasticsearch repo.
Note that OpenSearch also has analysis-kuromoji (Japanese), which we don't currently use, but which we do have configured, and which I would like to return to.
These are ours:
- experimental-highlighter
- extra
- extra-analysis-esperanto
- extra-analysis-homoglyph (confusable Latin/Cyrillic letters)
- extra-analysis-khmer
- extra-analysis-serbian
- extra-analysis-slovak
- extra-analysis-textify (general text: acronyms, camelCase, etc.)
- extra-analysis-turkish
- extra-analysis-ukrainian (unpacked version of analysis-ukrainian)
This is ours, but also theirs:
That leaves these:
- analysis-hebrew
Support for this plugin stopped years ago (it went commercial, IIRC) and we've been updating a fork ever since. David has made it available here , for example: https://people.wikimedia.org/~dcausse/ (file: analysis-hebrew-7.10.2.zip)
- analysis-stconvert (Traditional/Simplified Chinese characters)
This was the Chinese component I was most worried about. The maintainer (medcl) has started a company (infinilabs) and it has moved on GitHub to https://github.com/infinilabs/analysis-stconvert (though old URLs seem to still work, too).
They support Elasticsearch (currently through 8.4) and OpenSearch (currently through 2.12). They've moved releases off GitHub to their own website ( https://release.infinilabs.com/analysis-stconvert/stable/ ), but that would be the same problem (if it's an issue at all) for both Elasticsearch and OpenSearch
The ltr
(Learning to Rank) plugin may need some tweaking in order to be adopted in a Wikimedia hosted OpenSearch installation. analysis-hebrew
's JAR can be moved to Wikimedia's Gitlab installation. analysis-stconvert
appears to bear some challenges for future upgrades beyond OpenSearch 2.12 and Elasticsearch 8.4. It's anticipated that there are bound to be some challenges in adoption or migration of analyzers due to changes to OpenSearch, and they will be addressed if encountered, but this is believed to be acceptable.
If there is a hard blocker for language or machine-learned ranking support, it will be surfaced so that it's clear what effect this may involve.
Migration difficulties for third-party MediaWiki installations
editMigrations from Elasticsearch 6 to Elasticsearch 7 were said to involve nontrivial effort for third-party MediaWiki installations (although it probably varied from case to case). It's anticipated that a migration from Elasticsearch 7 to Elasticsearch 8 would similarly entail nontrivial effort. Most likely, an Elasticsearch 7 to OpenSearch migration would be more complicated.
The mitigations for this challenge involve:
- An early email deprecation notice to pertinent mailing lists
- Deprecation notices in code
- Possibly, continued attention to Elasticsearch 7-based support until the the MediaWiki official release following the first introduction of OpenSearch technology. For example, if OpenSearch technology is introduced in MediaWiki 1.44 a deprecation notice could be included in that 1.44 release, then MediaWiki 1.45 would remove support for Elasticsearch 7. If the OpenSearch technology introduction arrived toward the tail end of MediaWiki 1.44 release candidate, it may make sense to instead just set the hard removal of Elasticsearch 7 support for MediaWiki 1.46. Note, these versions are just examples.
Now, this is stated as possibly, as it would be easier on the maintenance side to only support one version.
So, for example, upon introduction of OpenSearch in MediaWiki 1.44, Elasticsearch 7 support would be removed as of MediaWiki 1.44. Again, this version is just an example.
In practice, sometimes it's necessary to support application layer code capable of using two backend technologies during transition, but more ideally it's avoided.
Elastica wireup
editThe Elastica PHP-based interface, which is used by the CirrusSearch MediaWiki plugin (via use of the the Elastica MediaWiki plugin) with Elasticsearch 7 wireup, is not officially supported for OpenSearch[1][2][3]. Therefore, it may become necessary to maintain a fork of Elastica, or work on its replacement in order to prevent accumulation of technical debt associated with a migration to OpenSearch instead of using Elasticsearch 8.
Initially this was implicitly assumed to be a downside of migrating to OpenSearch. However, the effort required to migrate the Wikimedia version of Elastica to a version supporting Elasticsearch 8 (Elastica claims Elasticsearch 8 support) is also not determined.
The code for the Elastica MediaWiki plugin is somewhat simple from the perspective of a caller of the plugin. The mitigation for possible breaking changes regarding Elastica include:
- Use / extension of Elastica code (theoretically backward compatible as OpenSearch ought to have API compatibility with Elasticsearch 7 API invocation) unless a hard fork is required
- Migration away from Elastica and toward the official OpenSearch PHP client; this would likely require a couple rounds of refactors to existing Elastica use
- Some twist on these two approaches
Options considered
editThe three options considered were:
- Do nothing. Stay on Elasticsearch 7.10 until compelled to migrate. Through successive rounds of deliberation on its roadmap areas, Search Platform has determined that a migration should happen sooner rather than later.
- Migrate to Elasticsearch 8. Based on initial analysis and some guesswork, this is presently ruled out.
- Migrate to the latest stable version of OpenSearch, presently OpenSearch v2. A migration to OpenSearch v1 is required on the path to OpenSearch v2, but the upgrade between v1 and v2 is believed to be straightforward. The main code level work is analyzer/plugin migration and probable Elastica work. There is significant work in shifting workload from Elasticsearch 7.10 to its replacement, as well as a need to communicate sufficiently with additional stakeholders for their own change management, as is typical with any migration.
Decision
editSearch Platform plans to migrate from Elasticsearch 7.10 to OpenSearch v2 (possibly OpenSearch v3 depending on release timing and compatibility in case of an OpenSearch v3 release).
Positive Consequences
edit- Reduction of risk of outdated software
- Unambiguous licensing
- Reduced risk of license key-required paid features or API challenges for ML-based and other search improvements
Negative Consequences
edit- More effort in migrating analyzers / plugins
- Probably more effort in Elastica PHP library transition
- Probable additional burden for third-party MediaWiki installations
Consequences
edit(TODO) Describe the context after you apply the decision, including effects on people and future work.