Talk:MinT

Latest comment: 9 days ago by Prototyperspective in topic Proposal: Machine Translated Wikipedia using MinT

Welcome to the MinT page

edit

You can use this page to start a discussion with others about how to improve MinT. Thank you! UOzurumba (WMF) (talk) 21:16, 25 September 2023 (UTC)Reply

edit

IndicTrans2. The IndicTrans2 project leads to -> https://ai4bharat.iitm.ac.in/indic-trans2 -> Oops! That page can’t be found.
Can someone realign please ? Thanks. -- Christian 🇫🇷 FR (talk) 16:42, 28 September 2023 (UTC)Reply

Hello Christian 🇫🇷 FR,
I checked the link and it seems to be working fine now. Thank you! UOzurumba (WMF) (talk) 19:54, 4 October 2023 (UTC)Reply
ok now too   Done. We leave unchanged. Thanks for ACK. Christian 🇫🇷 FR (talk) 06:25, 5 October 2023 (UTC)Reply

MinT should match with translation database we have contributed, but does not.

edit

De-deploy MinT please for en-ja translation, or give me a "stop" button to do without it. I need an option to stop MinT, and where can I do that? Are we sure the problem owes that MinT is a combination/ collaboration of two systems? Is there any language pairs that it outputs acceptable translation? When are we going to import the translation database from the previous system? That thesauri is very precious as the circle of translators has spent so many hours building it.

Again, I am talking about the en-ja language pair, and it is not practical to keep using it regardless of tech subjects or not. (details below) CX2 has been nulled for ja users, however, for vocabulary matter, it worked much better.

At the moment, if I am not turning off MinT on en-ja translation for Tech News:

  • I have to open past issues, c&p correct expressions;
  • Tech News has so many set phrases/expressions endemic to it, like iterating updates and so forth:
  • Wki markup is not only neglected but replaced to wrong characters; bold ''' to plain quotation marks". Why such very primitive error is present?
  • I feel not confident as so many sentences need to be manually c&p from past issues, which does not sound in line of our attribution policies to me as an Wikipedian.

If we need to invest and train the new MT system and its dictionary, does it paid from the pockets of translators? Do we use MinT with bitterness on our tongues, till we see MinT usable?

I appreciate how the MT system takes care of matching with the translation database, exactly why a low level system should be turned off for certain language pairs AFAIK. In my personal perception, I *need* to neglect MinT's suggestion approx. 85% of the time, and reasoned as:

  • 40% of it because it does not parse grammar correctly, inserts symbols I need to delete manually;
  • 35% of it because its dictionary is not match Wikimedia specific terminologies, which translators had trained the previous system;
  • 20% of it replaces wiki-markup wrongly; as above, for bold letters, ''' needs to stay as is , but MinT replaces it to plain quotation marks".
  • 5% that I can't trust its dictionary, or for a country name Belarus, MinT outputs Belgium. /: What kind of a bug can induce such primitive error?

MinT is below my expectations as an en-ja translator. Too bad I will not enjoy the MT assistance any more, while the old system has pampered its user, or me, by saving working times almost 40%.

FYI, my usecase:

  1. With the design of Tech News, translating from scratch is wasteful: iterated info should keep the sentence format and keep our readers for /ja pages affirmed that translators understand what we are doing.
  2. On ESEAP issues, the original text in en is actually an en output translated from the native language of the poster; means that much guess work is involved supplying secondary translation, or looking into wikidata helps me many times to match strange terminology to organization names or wiki teams.

Crossing my fingers that other language pairs are not affected this badly. Cheers, --Omtecho Omotecho (talk) 06:17, 30 September 2023 (UTC)Reply

Thanks for the feedback, @Omotecho.
MinT is a new initiative still in active development. It is not replacing any previous system: the suggestions from Translation Memory or other services like Apertium are still shown, when they are available. The translation memory (previous translations by editors to similar messages) are given priority, shown above the machine translation ones (in this example MinT suggestions are shown at the bottom of the list).
MinT uses different machine learning models to produce the translations. I'll provide more detail on some of the types of issues you are experiencing:
  • Translaiton models models support plain-text translation, and we are building support for more complex formats such as HTML and Wikitext on top of them. For example, improvements to support Wikitext are captured in this ticket. The issues with Wikitext can result in both (a) markup not showing corretly in the result and (b) contents being wrongly translated because markup gets in the way (e.g., resultng in a sentence being cut in half and translated independently, which leads to wrong translations). As Wikitext support is improved, these issues should reduce significantly.
  • For machine learning models the quality of the translation depends on the amount and quality of the training data. By providing more examples of good translations, the models can be improved. Currently, translating Wikipedia articles with Content Translation or contributing to Tatoeba are two easy ways to generate more quality data to improve the models. We also plan to integrate localization data from the Translate extension (more details in this ticket). In addition, contributing more Wikipedia-specific data will result in translations that align better with the community expectations.
As I mentioned, MinT is in active development and it has room for improvement, but for polishing a system that supports over 200 languages it is very useful to expose it to the communities in ways that they can help make it better.
Thanks! Pginer-WMF (talk) 08:37, 5 October 2023 (UTC)Reply
@Pginer-WMF, hello, as Sharing Free Knowlege is the initiative we both share, which is why I am disturbed by MinT. As I have a background as a dictionary/translation database editor at a private MT developer more than 20 yrs ago, and reading papers on MT development ever since, I can't agree with you.
For content translation, mind you I have tried to support and fill the samples, before MinT came into our view. However, ja speaking community did not find it necessary.[1] And cleaning the mess careless users had brought in with CX2 sits heavy on the same community.
When the MT engine is not suited as concept itself, we can't train it or make it usable, and MinT falls under that category. As you are aware, ticket is filed on T348361 on this matter: On meta, I and Lemonaka had filed a RfC to stop MinT on en-ja translation.
I wish you or anybody from your team would join discussion on AAMT aka Asia Pacific Association for Machine Translation as specialist with reliable data. I believe WMF tech team is much more experienced dealing with Good Faith users or those eager to contribute to the largest digital Encyclopedia and scientific entry on the catalogue of Species, and many of us are not experts in all fields, but who want to share Human Knowledge. Which makes WMF tech teams very unique in the field of MT AFAIK, compared to those major PC manufacturers as well as software giants who target commercial users.
FYI, the commercial MT systems have not inflated their market in past 20 yrs globally in regards to en/ja language pair, compared to other pairs in the market. Some claim theirs as the best, but in very narrow field of topics they specialize. A number of users off-wiki support particular application, but that is no proof that any one of those MT app is superior to other MT apps.
Or anyway, users will c&p from other web MTs and produce fake translation like this one.[2] Then looking back at MinT, why do we keep an inferior system which does not par even with that low quality? What do we gain as we limit the discussion to en-ja language pair?
Kindly, --Omotecho Omotecho (talk) 16:07, 7 October 2023 (UTC)Reply

translate.wmcloud.org inaccessible

edit

The test instance mentioned on this page seems to be inaccessible. It returns a generic Wikimedia Cloud Services "cannot be reached" error page. Chlod (talk) 04:03, 12 October 2023 (UTC)Reply

Maybe connection was busy? It works and tried with en-ja language pair. Omotecho (talk) 14:56, 12 October 2023 (UTC)Reply
Looks like it works now. Must have been a hiccup. Thanks for checking! Chlod (talk) 02:30, 21 October 2023 (UTC)Reply

Excerpts on NLLB-200 model card

edit
NLLB-200 is a research model and is not released for production deployment. NLLB-200 is trained on general domain text data and is not intended to be used with domain specific texts, such as medical domain or legal domain. The model is not intended to be used for document translation. The model was trained with input lengths not exceeding 512 tokens, therefore translating longer sequences might result in quality degradation.

—NLLB Team "No Language Left Behind : Scaling Human-Centered Machine Translation" page 183

Meanwhile MinT is clearly using NLLB-200 model

Translated in 2.07 seconds by nllb200-600M model

—Footnote notice on MinT, after the translation is complete.


Rtnf (talk) 14:29, 24 October 2023 (UTC)Reply

Is *new* translation memory still being created?

edit

With MinT, are new translation "memories" being created? That is, if I am translating a longer project (in my case, a whole course, made up of multiple pages and videos [as subtitles]), would MinT help encourage consistency in terminology based on the way certain terms were translated earlier? Asaf (WMF) (talk) 16:54, 14 November 2023 (UTC)Reply

Hello Asaf (WMF),
Currently, there is no translation memory. However, this feature request is known and captured in this ticket: https://phabricator.wikimedia.org/T96165. UOzurumba (WMF) (talk) 04:51, 20 November 2023 (UTC)Reply

Needs improvement for Kanuri language

edit

I tried the MinT to translate some articles into Kanuri, but most of the translations are not correct! hope to be improved for better experience. MohammedBama123 (talk) 12:31, 21 November 2023 (UTC)Reply

Yes kanuri languge need improvement theres lot of erros Umargana1 (talk) 20:55, 10 February 2024 (UTC)Reply
Thank you, @MohammedBama123 and @Umargana1 , for your feedback, and I apologize for the late reply. So, would you say that Machine translation model in the MinT is not a good aid at all for translation? Maybe you can give an idea of how bad it is on a scale of 1 to 10. UOzurumba (WMF) (talk) 17:17, 23 May 2024 (UTC)Reply
hi @UOzurumba (WMF) I can rate it 5/10 because it just keep repeating words that it didn't understand. MohammedBama123 (talk) 17:52, 23 May 2024 (UTC)Reply
Thank you, @MohammedBama123, for your reply. I have noted the repetition of words; with usage, the quality of the machine translation will improve.
UOzurumba (WMF) (talk) 15:40, 26 May 2024 (UTC)Reply
Alright thanks for the update. MohammedBama123 (talk) 16:23, 27 May 2024 (UTC)Reply

Tamil Wikipedia Content Translation

edit

I have checked Tamil Wikipedia Content Translation and I took the sample from en.wiki page, Supreme Court of India.

The source was:

The Supreme Court of India is the supreme judicial authority and the highest court of the Republic of India. It is the final court of appeal for all civil and criminal cases in India. It also has the power of judicial review. The Supreme Court, which consists of the Chief Justice of India and a maximum of fellow 33 judges, has extensive powers in the form of original, appellate and advisory jurisdictions.

Result from IndicTrans2 machine translation model:

இந்திய உச்ச நீதிமன்றம் என்பது இந்திய குடியரசின் உச்ச நீதித்துறை அதிகாரம் மற்றும் மிக உயர்ந்த நீதிமன்றமாகும். இது இந்தியாவில் உள்ள அனைத்து சிவில் மற்றும் கிரிமினல் வழக்குகளுக்கான இறுதி மேல்முறையீட்டு நீதிமன்றமாகும். நீதித்துறை மறுஆய்வு செய்யும் அதிகாரமும் இதற்கு உள்ளது. இந்திய தலைமை நீதிபதி மற்றும் அதிகபட்சம் 33 சக நீதிபதிகளைக் கொண்ட உச்ச நீதிமன்றம், அசல், மேல்முறையீட்டு மற்றும் ஆலோசனை அதிகார வரம்புகள் வடிவில் விரிவான அதிகாரங்களைக் கொண்டுள்ளது.

Result from Google translation:

இந்திய உச்ச நீதிமன்றம் என்பது இந்தியக் குடியரசின் உச்ச நீதிமன்ற அதிகாரம் மற்றும் உச்ச நீதிமன்றமாகும். இது இந்தியாவில் உள்ள அனைத்து சிவில் மற்றும் கிரிமினல் வழக்குகளுக்கான இறுதி மேல்முறையீட்டு நீதிமன்றமாகும். நீதித்துறை மறுஆய்வு செய்யும் அதிகாரமும் இதற்கு உண்டு. இந்தியாவின் தலைமை நீதிபதி மற்றும் அதிகபட்சமாக சக 33 நீதிபதிகளைக் கொண்ட உச்ச நீதிமன்றம், அசல், மேல்முறையீட்டு மற்றும் ஆலோசனை அதிகார வரம்புகள் வடிவில் விரிவான அதிகாரங்களைக் கொண்டுள்ளது.

Unfortunately, both look artificial, not natural; also both use transliteration, not proper Tamil word. AntanO (talk) 07:24, 18 January 2024 (UTC)Reply

Add Dobrujan Tatar

edit

Hi there,

I have a request to add Dobrujan Tatar. It is seen as a "form" of Crimean Tatar, similar like Tajik and Persian. For alphabet and grammar this book can be helpful and many translations are to find in Dobrujan Tatar — Romanian dictionaries, Dobrujan Tatar — Latin dictionary (ornithology) and Dobrujan Tatar — Latin dictionary (botanic). Some examples of translations can be found in translated books:

Zolgoyo (talk) 15:07, 2 February 2024 (UTC)Reply

Feedback from the Igbo Wikipedians about MinT for Wiki Readers

edit

Thank you! For testing the MinT for Wiki Readers feature. Please leave your feedback below. UOzurumba (WMF) (talk) 06:41, 10 June 2024 (UTC)Reply

I explored the tool today and I can attest that I like what I saw... Knowing the rigorous process of copying and pasting for translation compared to the automatic translation of this tool, I commend its user friendly feature. Iwuala Lucy (talk) 19:22, 11 June 2024 (UTC)Reply
Thank you @Iwuala Lucy, your feedback is noted. UOzurumba (WMF) (talk) 19:05, 13 June 2024 (UTC)Reply
Having tried the tool and searched some topics to an extent, it is superb, it provides basic reliable translation needs making it a useful tool, however, I noticed it doesn't translate the topic only the content, it would also be good if we could add the feature that if the topic no matter what language it's typed in, when translating to another language can have the translation in parenthesis beside the original word. Above all thanks for the solution, it is reliable, thanks for the commitment. Dagentle (talk) 12:42, 12 June 2024 (UTC)Reply
Thank you, @Dagentle, for your feedback. To be sure I understood it clearly, you are saying that we can improve it by ensuring the feature also translated the content title. Also, we can differentiate the translated from the source content title by putting it in parentheses beside the original. Please let me know if the above is what you mean.
Best regards! UOzurumba (WMF) (talk) 19:21, 13 June 2024 (UTC)Reply
@UOzurumba (WMF), Exactly what am talking about. Thanks for your understanding and commitment to making this work Dagentle (talk) 09:46, 9 August 2024 (UTC)Reply
The MinT Tool is a really great tool. It's user friendly nature makes it very easy to navigate to any language of choice and that makes it even more interesting. I haven't noticed anything wrong with the tool yet but when I spot one, I will get back to you but for now, love the tool and I commend the developers..Cheers. Nwonwu Uchechukwu P (talk) 07:52, 14 June 2024 (UTC)Reply
The tool is great, I love to read more with it Stanley kadurumba (talk) 09:32, 14 June 2024 (UTC)Reply

Feedback from Wikipedians about MinT for Wiki Readers

edit

Thank you! For testing the MinT for Wiki Readers feature. Please leave your feedback below. UOzurumba (WMF) (talk) 19:24, 13 June 2024 (UTC)Reply

The MinT Tool is a really great tool. It's user friendly nature makes it very easy to access to any language of choice and that makes it even more important. — Preceding unsigned comment added by Ekenedilichukwupraise (talkcontribs)

Lovely development. But there should be option for EDIT on the the page right there so that one can make corrections immediately on any error(s) noticed. — Preceding unsigned comment added by Goodymeraj (talkcontribs)

– Thanks for your feedback! Currently, machine translations are provided by machine learning models which cannot be updated immediately. At the moment, the alternative we provide is to write a human-created translation of the contents with Content Translation. This does not cover the usecase of making a small, quick fix when finding an issue in a machine translated content. We have plans to support this with community-provided translations (more details in this ticket). In this way, users can make a correction to the specific translation of a sentence, and that correction will be incorporated to a translation memory system, making it immediately available for future machine translaiton requests. Creating this system is exciting since it can help to improve machine translaiton quality, but it is not a trivial efforts, so it will still take some time. Please, feel free to subscribe to the linked ticket and share any comments or suggestions. Thanks! --Pginer-WMF (talk) 13:14, 17 June 2024 (UTC)Reply

Feedback

edit

Thank you for contacting our editor in Fulfulde Wikipedia, I am writing this on behalf of {{User:Adamu ab|Adamu ab}} I use the tool today and I can attest that I like the new development.... Knowing the hard process of copying and pasting for translation of using machine compared to the automatic translation of this tool, this welcoming development

Fulani215 (talk) 13:58, 19 June 2024 (UTC)Reply

Thank you @Fulani215, for your feedback. UOzurumba (WMF) (talk) 19:43, 19 June 2024 (UTC)Reply

Feedback from Asturian Wikipedia

edit

Hi all. I am very impressed by this development. I'm willing to see it enabled by default on the Asturian Wikipedia. Thanks for this exciting feature! YoaR (talk) 14:51, 1 July 2024 (UTC)Reply

Thank you, @YoaR, for your feedback. We will soon enable the MinT for Wiki Readers in Asturian Wikipedia. UOzurumba (WMF) (talk) 21:16, 2 July 2024 (UTC)Reply

Feedback from Turkish Wikipedia

edit

Full support by me. I wanna use this tool on the Turkish Wikipedia. Good job! Lustrouss (talk) 16:19, 3 July 2024 (UTC)Reply

Feedback and required improvement for the Kadazandusun (dtp) Language

edit

I've tested the tools you provided, and I appreciate that they support this minority language. However, they require significant improvement since many of the terms are not in the standard Kadazandusun, particularly the Bundu Liwan dialect. If you need any assistance in refining these tools, I'm happy to help! Thanks. Jjurieee (talk) 09:57, 6 September 2024 (UTC)Reply

I agree with your statement, and I also like to express my interest in contributing to the development and refinement of translation functions for the Kadazandusun language in the system. It would be a meaningful opportunity to help enhance the accuracy and accessibility of the language for a broader audience. Blusjai (talk) 04:19, 10 September 2024 (UTC)Reply
I'm with you on that, I am really looking forward in translating other languages article to dtp and I once tried to translate an article in Wikipedia but the lack of dtp translation tools and the accuracy made it difficult for me to translate the article. By upgrading these tools it might help us Kadazandusun people to contribute more in Wikipedia. Nelynnnnn (talk) 05:09, 10 September 2024 (UTC)Reply
I appreciate the development of these translation tools to assist the Kadazandusun language, which is crucial for promoting our linguistic heritage. However, I've noticed that some translations are not entirely accurate, particularly with terms that may not align with the Bundu Liwan dialect. I believe that with further refinement, these tools could become even more effective, helping Kadazandusun speakers contribute more accurately to Wikipedia and other Wikimedia platforms. I'm more than willing to assist in improving these translations to better serve our community. Rruunnaa (talk) 13:56, 10 September 2024 (UTC)Reply
The statement above highlights the importance of supporting minority languages like Kadazandusun, particularly the Bundu Liwan dialect, in technology tools. It acknowledges the current effort to provide such tools but points out the need for further refinement to ensure accuracy and inclusivity. Offering assistance to improve these tools reflects a proactive approach to preserving and promoting linguistic diversity in modern digital platforms 2405:3800:901:A396:30B0:4620:2C01:B592 03:12, 18 September 2024 (UTC)Reply
Hi all! Thanks for trying out the language model. After discussing with the team, I learned that the research and development of machine translation models for a specific language requires both linguistic and technical skills. There are many research papers and frameworks available on this topic, but providing guidance on how to proceed is beyond the scope of our team. However, we suggest that you consider reaching out to this project team developing a natural language toolkit library for bahasa Malaysia. The lead developer's email address is listed on their GitHub profile. Additionally, some universities in Malaysia are working on similar projects, and according to our senior engineer, there are research papers you could explore. User:SSethi_(WMF) 22:41, 18 September 2024 (UTC)Reply

There are hesitations in the Turkish Wikipedia

edit

Due to its linguistic structure, Turkish is a language where special attention must be paid to suffixes during translation. In machine translation, this issue has not been resolved for a long time, leading to frequent encounters with nonsensical sentences. Although some AI-powered translation systems like DeepL handle this better, they also have significant shortcomings. For example, the translations I reviewed on Turkish Wikipedia are truly unsuccessful. Four out of every five sentences are meaningless. The remaining ones have flaws. Until a better translation engine is used, employing this feature on Turkish Wikipedia will create a rather poor impression in the eyes of readers. My personal opinions on this matter may be harsh, but all I want is a higher-quality encyclopedia. Therefore, before activating this feature, please hold a community vote on Turkish Wikipedia. Mahfuzat (talk) 17:43, 27 September 2024 (UTC)Reply

Proposal: Machine Translated Wikipedia using MinT

edit
 

Great initiative! Good to see some innovation, improved independence, and harnessing of AI tech progress here. Please see meta:Community Wishlist/Wishes/Wikipedia Machine Translation Project. I think MinT could be used for that. Questions or feedback are welcome. Prototyperspective (talk) 21:30, 12 October 2024 (UTC)Reply

I have a question for contributors to MinT or people who know much about it:
  • Is it or will it be possible to specify 'low certainty of correct translation' for phrases & words?
It's one part of the proposed MTWP project site. 1. If a machine translated article is adjusted by a MTWP contributor, other articles with very similar phrases (e.g. similar context & same word) would get flagged for review if not automatically adjusted as well. Low certainty of correct translation could be used here – so for these if the context is sufficiently similar they may get autoadjusted. 2. If a word has an ambiguous meaning and multiple meanings make sense to the model in the given context, it would set 'low certainty of correct translation' on the sentence (a type of flagging).
Secondly:
  • Please consider whether the proposed MTWP could be built as a successor project to "MinT for wiki readers" at some point (e.g. once translations reached a certain well-readable usually correct baseline quality): there are many benefits of this compared to just enabling readers to make use of auto-translation such as making these pages better findable, read more, linkable, correctable/adjustable, include translated media (eg redubbed videos or translated charts), & more. If the text of the proposal is too long you could just take a glance over it and I guess the image on the right summarizes it somewhat; more involvement would be needed. I don't think there's much that could be more useful to readers and increase Wikipedia readership/reading more than that (just widely available AI-based spoken Wikipedia audios with a new audioplayer may not be far behind).
Prototyperspective (talk) 16:52, 11 November 2024 (UTC)Reply
Return to "MinT" page.