Wikimedia Language and Product Localization/Newsletter/2024/January/dtp
Welcome to the January 2024 edition of the Language and internationalization newsletter by the Wikimedia Foundation Language team!
This newsletter provides you with quarterly updates on new feature developments, improvements in various language-related technical projects and support work, community meetings, and ideas to get involved in contributing to the projects.
Key highlights
Fon Wikipedia officially launched after five years of development in the Wikimedia Incubator
Fon Wikipedia, born at Wikimedia Hackathon 2018 in Barcelona, has officially launched after graduating from the Incubator! Fon is spoken by millions in Benin and Togo and is the mother tongue for many. It's also widely used in Benin as their national language. It took five years to create this new Fon Wikipedia. Since many people couldn't write in Fon, and native languages in Africa get less attention than others, building a community to support the project was a tough challenge for the community members who started it.[1] Also, discover more about the four new Wikimedia language projects that were approved recently (Wikipedia Dagaare, Wikipedia Moroccan Amazigh, Wikipedia Toba Batak, and Wikiquote Banjar).
Introducing Sentencex, tool for enhanced Natural Language Processing (NLP) and multilingual sentence extraction
The language team has just launched a new tool called Sentencex, now available in both Python and Javascript. Sentence segmentation, an essential part of natural language processing, involves breaking down a text into individual sentences. This process has various uses and helps improve language functionality and speed, especially in Wikimedia's new machine translation system (MinT) and the section translation project.[2]
You can find the tool on GitHub and see it in action.
MinT translation service available to 55 new Wikipedias, doubles content, ranks second in usage
The new machine translation service, MinT, which now offers machine translation for the first time to 55 Wikipedias, has had a positive impact on Wikimedia language communities. This extensive language support has nearly doubled published translations, and articles created using MinT have a low deletion rate (1.72%). MinT is now used in 8% of the translations published with Content Translation, making it the second most used translation service in Wikipedia, after Google Translate, in just a few short months.[3]
Open language identification service now available for 200+ languages
The Language team created an open language identification service to automatically detect the language in which a given text is written to simplify users' interaction with Wikimedia platforms. The service supports the detection of 201 languages, and anyone can access the API to use the service. Currently, the final checks for the service and the evaluation of its ability to withstand high traffic are underway.[4]
Wikisource now recognizes handwritten texts with Transkribus
Handwritten text recognition is now active on Wikisource through the Transkribus OCR Engine. Transkribus, an AI-powered platform, simplifies the handling of handwritten or printed manuscripts by offering various models tailored to different writing scripts, historical periods, and other factors. The Transkribus engine is now available as an option alongside Google and Tesseract and it is currently operational on the Wikisources listed on this page.[5]
Unified section translation dashboard for desktop and mobile users
The Language team is actively working towards the adoption of a unified section translation dashboard for both desktop and mobile users. Originally designed for mobile in Content Translation, it's now being refined to serve as a unified dashboard across various platforms, providing an improved translation environment. Currently in beta mode, you can test it on Test Wikipedia or any Section Translation-enabled wiki using the URL parameter "unified-dashboard=true" (e.g., ig.wikipedia.org/wiki/Special:ContentTranslation?unified-dashboard=true).
This unified dashboard offers a seamless cross-platform translation experience. Users can start translating on their desktop and continue on a mobile device, or vice versa. It also supports section translations on the desktop, giving users flexibility across devices.
Community meetings and events
- The upcoming language community meeting is scheduled for Wednesday, February 21st, 12:00 to 13:00 UTC. If you wish to participate, sign up using the provided link. Want to share a technical update about your project? Feel free to add it to the Technical updates section in the agenda document.
- In case you missed the first language community meeting in November 2023, you can catch up by watching the video recording and reading the notes.
Get involved
- If you are looking for technical tasks, take a look at the easy tasks that haven't been assigned yet in various language project repositories on Wikimedia Phabricator.
- If you are looking for tools to edit and translate articles and interface messages, you can use Content Translation and Special:Translate tool on Translatewiki.net. These tools make it easier to work with content in different languages.
- Report feedback on talk pages of language tools.
Stay tuned for the next release! You can subscribe to this newsletter.
References
- ↑ https://diff.wikimedia.org/2023/10/04/welcome-to-the-fon-wikipedia/
- ↑ https://diff.wikimedia.org/2023/10/23/sentencex-empowering-nlp-with-multilingual-sentence-extraction/
- ↑ https://diff.wikimedia.org/2023/11/20/unlocking-the-worlds-languages-in-wikipedia-a-look-into-mints-impact-so-far/
- ↑ https://diff.wikimedia.org/2023/10/24/open-language-identification-api-for-200-languages/
- ↑ https://diff.wikimedia.org/2023/07/13/enabling-handwritten-text-recognition-on-wikisource-using-transkribus-ocr-engine/