MinT (Machine in Translation) is a machine translation service based on open-source neural machine translation models. The service is hosted in the Wikimedia Foundation infrastructure, and it runs translation models that have been released by other organizations with an open-source license. An open machine translation service can be a key piece of the essential infrastructure of the ecosystem of free knowledge. This page captures the initiatives to scale the service and make this infrastructure more widely available.
About MinT edit
MinT is designed to provide translations from multiple machine translation models. Initially, it uses the following models:
- NLLB-200. The latest model from the No Language Left Behind project by a research team at Meta. This model supports translation across 200 languages, including many that are not supported by other vendors.
- OpusMT. The OPUS (Open Parallel Corpus) project from the University of Helsinki compiles multilingual content with a free license to train the OpusMT translation models. Anyone can easily help improve the translation quality by participating in the different projects that contribute data to OPUS. For example, when using Content Translation to create translations of Wikipedia articles, the data on published translations will be incorporated as a new resource to improve the translation quality for the next version of the model. Another quick way to contribute is to provide sentence translations with Tatoeba.
- IndicTrans2. The IndicTrans2 project provides translation models to support over 20 Indic languages. These models were developed by AI4Bharat@IIT Madras, a research group at the Indian Institute of Technology Madras.
- Softcatalà. Softcatalà is a non-profit organization with the goal to improve the use of Catalan in digital products. As part of the Softcatalà Translation project, translation models used in their translator service to translate 10 languages to and from Catalan have been released.
MinT supports over 200 languages, with more than 50 languages not supported by other services (including 27 languages for which there is no Wikipedia yet). You can read more about the initial release of MinT and check some frequently asked questions in the summary page for the service.
Technical details edit
The translation models have been optimized for performance using OpenNMT Ctranslate2 library in order to avoid the need for GPU acceleration. This makes it easier for organizations and individuals to build and run their own instances. For more details you can check the source code, the API spec, and a test instance.
MinT provides a platform to run multiple translation models. In order to support different initiatives, aspects such as sentence segmentation, language detection, pre/post-processing of contents, and rich format support has been developed on top of the plain-text based models.
Get involved edit
Feel free to share any feedback in the discussion page. Planned improvements are captured in Phabricator (more info), you can report wrong behavior or propose feature enhancements, track the progress of any task, and share your perspective on it. For completed work you can also check the status updates below.
MinT for translators edit
Translation is a common way to contribute in the Wikimedia ecosystem for multilingual users. Machine translation can provide a useful initial translation for users to review and improve. The Language team has developed tools to support translations in their workflows that can integrate different machine translation services to speed up their processes. Once MinT was available, integrating it with these tools was a logical next step to amplify their impact. MinT is available in the following projects:
- Content Translation. Content Translation provides guidance to create a translation of a Wikipedia article into another language. Content Translation integrates several translation services to provide an initial translation.
- Localization infrastructure. The Translate extension provides the infrastructure used to translate our software and multilingual pages. Communities of translators use it on translatewiki.net, Wikimedia Meta-wiki, Mediawiki.org and more.
MinT for Wikipedia readers edit
The number of topics and the amount of information a reader can learn about from Wikipedia depends on the languages they speak. Machine translation can help people to learn more about their topics of interest when the content is not available in their language.
This initiative explores how to surface the machine translation support from MinT in Wikipedia articles in a way that:
- Allows readers to learn more about the topics of interest from other languages
- Clearly differentiates automatically generated content from community-created one.
- Encourages to contribute to community-created content when possible.
At the moment the Language team is working on the design and research aspects of the project to identify the best ways to surface MinT on Wikipedia and the technical explorations for the service to work in this context.
MinT more widely available edit
Working on the previous initiatives will help to polish and solidify the system. For now, the MinT API is only available for Wikimedia products. As the system gets ready, we'll consider a wider exposure. Providing a service that can be used by communities in innovative ways can be a very powerful tool. New initiatives to make MinT more widely available will be captured here in the future. Meanwhile, feel free to configure your own MinT instance to experiment with it.
Status updates edit
October 2023 edit
- MinT is now supported in Content Translation for Fon, a Wikipedia that graduated recently from incubator.
- Announced sentencex library: sentencex: Empowering NLP with Multilingual Sentence Extraction - A python and js library to meet the needs of sentence segmentation for all the languages we support.
- Proposed model card for language identification as part of the creation of a LiftWing service to provide those capabilities for MinT and others.
- The new sentence segmentation approach has been exposed in Content and Section Translation to validate it with real contents. Resolved community-reported issues such as the problems translating court cases.
- MinT test instance provides consistent language names with Wikipedia by using Wikipedia APIs instead of the limited browser localization capabilities.
- Launched the Language Identification service to automatically detect in which language is written a given text. The service supports the detection of 201 languages, and anyone can access the API to use the service or read the model card for more details. Machine Learning team completed the last checks after deploying to LiftWing and evaluating that the service can "easily withstand a high amount of traffic".
- Basic support for rich text translation by supporting transferring of markup to apply styling such as words in bold from the source text into the equivalent ones in the machine translation (which lacks format since translation models operate with plain-text).
- Completed the process to enable MinT for languages with no Wikipedia yet . Translation models in MinT support 25 languages for which there is no Wikipedia. These can be tested in MinT's test instance for speakers of those languages to assess quality, and ensures that translation tools are well-equipped once wikis are created for those languages (as it has been the case with the recent graduation of Fon Wikipedia out of incubator).
- Completed the process to enable MinT for closely-related languages based on Community input . For some languages where machine translation is not available, Wikipedia editors have asked to have access to machine translation in Content Translation using a related language instead of having no support at all. With this enablement translators of Gan (gan) Wikipedia will have machine translation based on the traditional script variant of Chinese as a starting point.
- Analysis of translation activity on 55 languages for which MinT provides machine translation for the first time shows how (a) translations have increased 2X since MinT is available, and (b) deletion rates have not increased. Activity levels for these 55 wikis changed from ~500 translations/month, to 1K+ translations/month after MinT was enabled. For example, a recent peak of 2.15K translations were published in August 2023 when MinT was available for those languages, which is a significant increase from 225 translations in August 2022 when MinT was not available for them.
- Better visibility of translation quality by including a tag in translations where unedited machine translation is close to the limits. This will facilitate analysis about translation quality and limits.
- Created prototypes for upcoming research illustrating 5 concepts on how MinT can be used by Wikipedia readers and supporting the 4 languages we will conduct research in: Hindi, Chattisgarhi, Awadhi, and Korean.
- Improvements for MinT to process more predictably contents with new lines in them.
September 2023 edit
- Completed initial design exploration to illustrate 5 concepts on how to surface machine-translated contents from other languages for Wikipedia articles
- Completed enablements of MinT in Content Translation for Lingurian, where the community requested further clarifications about MinT, and the last set of 14 languages that could be supported with the NLLB-200 model.
- Enabled MinT for translatable pages on test wiki
- Expanded exposure of MinT with the enablement of Content Translation mobile and desktop experiences as default in 7 Wikipedias supported by MinT (Cherokee, Tongan, Hungarian, Kazakh, Kyrgyz, Minangkabau, and Sardinian).
- Completed the validation for all languages supported by the translation models used by MinT as part of the final QA for enabling the new translation service.
- Santhosh presented at the 10th Workshop on Asian Translation emphasizing the need for machine translation to be universal, free, and available in more languages. A message well received by the attendees.
- Research planning started with an initial draft of the research brief for MinT on Wikipedia
- Continuing technical explorations for applying machine translation beyond plain text (what underlying models provide) to support the Wikipedia context: A new improved approach for sentence segmentation (with a demo page to try) that provides a more accurate way to identify when a sentence ends in different languages, and with a preference to avoid splitting in case of doubt (preferred in the context of machine translation to avoid fragmenting the context of a translation, for example, misinterpreting the dot of an abbreviation as a fullstop).
August 2023 edit
- Successful exploration for the use of MinT to translate structured formats such as HTML, SVG and markdown.
- Completed the deprecation of Youdao, an external translation service that was failing for a long time.
- Continued design exploration for MinT on Wikipedia with new and updated workflows based feedback.
- Identified languages which can benefit the most from new OpusMT models
- Made MinT the default translation service for Zulu in Content Translation
July 2023 edit
- Enabled machine translation with MinT (and communicating with communities) for 75 new languages: 62 languages where the mobile translation experience is available, and 13 languages where translation quality from other services may not be ideal based on the MT usage report data and/or community feedback.
- Validation of previous enablements: identified issues with Bhojpuri and with Latvian where MinT was not available due to mismatches with the language codes used by Wikipedias, MinT and the underlying translation models.
- Initial design explorations and prototypes on ways we could integrate MinT in Wikipedia
- Improved Mint translation post-processing to better support languages using the Arabic script by avoiding extra paces after fullstops.
- Completed the integration of the IndicTrans2 model by verifying the enablement of all their 23 supported languages.
- Initial analysis of activity for Wikipedia communities that are supported with MinT for the first time to identify potential pilot wikis for future research and as early adopters.
- Enablement of MinT on translatewiki.net for the use in localization of Wikimedia and other open projects.