Topic on Talk:Wikimedia Developer Summit/2017/Topic ideas

WikiDev17 topic: Multilingualism

21
Summary by RobLa-WMF

Positions so far:

RobLa-WMF (talkcontribs)
This is the text of the "Multilingualism" section of WikiDev17/Topic ideas as of this writing

How can we make our websites better support languages other than English (and character sets other than Latin)?

Doubling down on Machine Translation

  • Annotation service to record fine-grained translation correspondences between wikis over time (not just at the time of first translation)
  • Suggestion service to suggest new edits to wiki A when translated text wiki B is modified (or vice-versa)
  • Refactoring existing language converter pairs as (sometimes trivial) translation engines, eg cyrillic-to-latin
  • Building a translation engine in house, training it with translated wiki pages, improving it over time, etc
  • Tightly integrating the translation UX for everyone. More: one community wearing babel fishes / Less: scattered villagers after the Tower of Babel fell.
  • Improving harassment/vandalism/civility/inclusiveness/diversity mechanisms to handle these larger cross-cultural communities.
  • i18n of global pages, global templates, etc. May need mechanisms to allow translation of comments, for example.

Fora: translators-l, translatewiki.net

Strainu (talkcontribs)

I would like to remind you of my suggestion to think broadly here: don't limit the pool of potential participants only to people actively working on i18n in the software , but search for people reporting bugs and writing scripts as well.

Qgil-WMF (talkcontribs)

Yes, I totally agree. If you think someone should be at the Summit discussing these topics, please ask them to request an invitation / travel sponsorship. We are also trying to improve remote parcipation, which is a factor for any topics proposed but evidently useful for this one.

Qgil-WMF (talkcontribs)

@Runab_WMF there is clear interest for good coverage of i18n topics in the Summit. In addition to all the discussions around this topic, I can see how the translation / multilingualism aspect may plug well into two two Wikimedia process that surround the Summit: the Community Wishlist Survey (which we want to see with an increased participation of editors of non-English projects -- see Phab:T144074) and the Wikimedia Strategy movement discussion, which will start in January as well and will have an obvious focus on our diversity of readers, editors, and communities.

Now, in order to fulfill these expectations I think we need several things:

  • confirmation that the Language team will be well represented at the Summit
  • participation of volunteers and other stakeholders involved in translation and other aspects of i18n
  • ideally, a main topic pointing to some direction in order to focus the attention of participants and proposals, for instance, is Machine translation the thing to focus on? Better than "i18n".)
Runab WMF (talkcontribs)

@Qgil-WMF - to respond to the 3 points you make here:

  • The Language team has plans to attend the DevSummit (like all the past years), subject to budgetary (and associated) logistics being completed
  • A major disadvantage is that the conference location may not have many local MediaWiki i18n participants. However, a combination of the Language team, other WMF developers and other participants involved with i18n and l10n projects with overlapping interest may be able to move forward on some focused topics.
  • I would vote against a mammoth topic like Machine translation and particularly like 'building an in-house translation engine'. It would be a stretch on time and resources. It might actually be better to reach out to people we already collaborate with (like the Apertium project) and explore options that can be of shared interest, but I don't think the DevSummit would be a good place for that.

The Internationalisation wishlist is somewhat dated and possibly needs another review. Language support has so many different aspects and for so many user groups, even within MediaWiki that I am somewhat unsure whether we can zero in on a main topic. Another option could be to have perhaps 2 running themes - a generic theme, and another that addresses a specific area of i18n (e.g. RTL), and then find individual main topics in these 2 groups. Any thoughts on this? Thanks.

Qgil-WMF (talkcontribs)

@Runab WMF deciding main topics is always somewhat unfair, somewhat inaccurate. Still, they can be useful. The idea is that the multilingualism main topic you define will help us reaching out to the related contributors and assuring that they can participate in the Summit.

See the list of main topics proposed so far. I hope it helps defining the main topic for this area.

In addition to the main topics, anyone will be able to propose more specific topics, and the very least as an Unconference session. By defining a main topic you are not ruling out possibilities to have other conversations at the Summit.

Runab WMF (talkcontribs)

I am inclined to base the language topic along side the main topic of 'editorial collaboration', primarily because of the options to connect into this. To clarify, I am not trying to shoehorn a open topic into what the Language team is currently working on. However, given that for the past 2 years the primary focus has been on an editing tool, there have been several conversations (direct or incidental) during this entire period, on things that could have been in a different state for the larger good. Templates is one such example. I am pinging @Amire80 about this. He had some thoughts on this topic from the last DevSummit.

Qgil-WMF (talkcontribs)

@Runab WMF, if I understand your reply correctly, there would be no main topic about multilingualism. You would submit multilingualism-related proposals under main topics where corresponds, and of course you are free to propose specific topics in the Unconference context. Correct?

Qgil-WMF (talkcontribs)

Yes, correct. No main topic for multilingualism this time. Multilingualism related proposals are welcome either in relation to main topics or in the Unconference.

Runab WMF (talkcontribs)

Is there a way we can mention this clearly on the main page for the topics so that people are aware that they can submit language related topics both for the summit and the unconference? Otherwise people may be looking for a main topic for language support and won't know for sure where to put them. Thanks.

Qgil-WMF (talkcontribs)

I think we should not go beyond main topics in the main page, otherwise we might have similar feedback about other areas in a similar situation. The best the Language team and anybody interested in multilingualism can do is to promote the Summit and encourage the submission of proposals around this topic directly. You work on a regular basis with developers, translators and multilingual users. Reach out to them directly and tell them about the Summit, otherwise they will not even be aware of this event and its main page.

Cscott (talkcontribs)

I will note that historically the summit has not had good attendance from those with experience or expertise in i18n topics. I think we'd need to make a big push w/ explicit invitees, scholarships, etc in order to overcome this.

Qgil-WMF (talkcontribs)

We want to change that history. In addition to putting more emphasis in flying to San Francisco more people working in this area, I wonder if we could reach out to people already in SF Bay Area working in i18n (but not Wikimedia) to join us. So many problems are so common, and the Bay Area has a huge pool of projects and experts (and some of them know Wikimedia very well).

Runab WMF (talkcontribs)

@Qgil-WMF This is actually a great idea but my worry is that it depends on whether this particular group of people will be available on those dates and later. Secondly, some from this group may not be active participants in Open Source projects or may not be in a position to share information about their work. The Language team used to host something called a Language Summit earlier, primarily for Indic languages, with a similar group. While it was supremely useful to get things rolling or even completed (given discussions started early), following up from where we left it off at the end of the summit was a challenge as people moved on to their own projects/work. Thanks.

Qgil-WMF (talkcontribs)

Also, @Nemo bis has pointed to Internationalisation wishlist 2014, but that is a long flat list, and I don't know how up to date it is now. As someone interested in multilingual wikis, I have tried to go through that list to find a main topic/theme, but I could not. Ping @Nikerabbit (I have already reached out to @Runab WMF).

Amire80 (talkcontribs)

My opinion is my own, but it's close to what @Runab WMF said: Internationalization is great, and machine translation in particular is great. Moar Free-Software machine translation is even greater, and Wikimedia should become an important player in this field some time in the foreseeable future.

And yet, it's a tad early to focus on this in 2017. Given the current resources, Wikimedia simply cannot go beyond the following two things:

  1. Supporting existing Free Software machine translation projects. This is already being done—see Meta:Grants:IEG/Pan-Scandinavian Machine-assisted Content Translation for a successful example. I hope to see more of this, but it's not really relevant for the dev summit.
  2. Talking about far-fetched dreams of how it should be some day. This was already done in the last dev summit. It was pretty, but not much has changed since then in terms of machine translation. It's quite possible (really!) that stuff will change next year and it will become a much more relevant topic for 2018.

Till then, however, there are a bunch of much more urgent things to clear out. Here are some examples:

  1. Support for cross-wiki templates. @Legoktm made an excellent presentation about his proposal for the technical implementation of global templates in dev summit 2016. There was nothing in his presentation about the user side, however: how will the templates be actually internationalized and localized. This is an immediately important topic given the current developments in ContentTranslation, in which the template support is being completely overhauled.
  2. Cleaning up the hairy mess around "standard" language codes and site name lookup. @Duesentrieb's SiteIdMapper project is one tip of this iceberg, and there are many more. Getting this rolling will unblock or outright fix a lot of other bugs in Wikidata, ContentTranslation, searching, etc.
  3. What next for interlanguage links? Their design has significantly changed on desktop and on mobile in 2016, and this is already making a positive impact, but more could be done to improve their design and make them even more accessible, and to improve their technical handling (e.g., a proper JS API for handling them on desktop instead of hacky jQuery selectors).

These internationalization topics immediately come to my mind for January 2017. They may be less attractive than machine translation, but they are much more immediately relevant. Resolving them will bring future machine translation projects within our reach. Two prerequisites for successful machine translation projects are engaging people who know different languages and getting a lot of parallel texts actually written. Content Translation and Wikidata are projects that make this much easier, and the three points above will make them run more smoothly.

Cscott (talkcontribs)

Amir and I talked about this at the editing offsite. I think we identified several near-term projects that would lay the ground work for future machine translation:

  • Exporting interlanguage links in "apertium dictionary" format
  • Starting to collect part-of-speech information in wikidata for articles (thus, interlanguage links)
  • Exporting CX translation pairs in "moses training data" format.

Amir -- was there anything else I've forgotten? I think those three projects fit in well with the "far-fetched dreams" of what WMF *might* do with Machine Translation, and lay important ground work.

Nikerabbit (talkcontribs)

I don't think these are the projects that will help us to get to the glorious future.

To build a good machine translation system you need a good data. I think interlanguage links are quite low quality in that regard. For collecting parts of speech information, there is work going on towards Wikidata for Wiktionary that would make this task moot.

For exporting things in "moses training data format", I think that anyone familiar with Moses can more easily convert our dumps to the correct format than we. In any case our data is not well aligned, so heavy post-processing is needed. Opus is one possible platform that could do that processing and distribution.

Improving our translation memory engine, which is not necessarily more complicated than the proposed projects, so that the engine could be integrated to ContentTranslation, would provide benefits to languages where machine translation does not yet exist, in addition to improving all our existing translation processes where translation memory is already in use.

Legoktm (talkcontribs)

I personally would like to see anonymous users being able to set an interface language (primarily for multilingual wikis), which we can probably now do with varnish+xkey.

And I think making Translate+VE or even Translate+wikitext work nicely would be awesome, and fix a lot of technical debt as well. It would also maybe be a decent summit topic because it would require support from different teams (Language, VE, Parsing, etc.). And every person who edits mw.o would love it. (But that would require Translate work getting resources, and I don't think that's been the case lately...).

And last idea for tonight would be librarizing MW's awesome i18n code. The mission of the Wikimedia movement requires all knowledge to be accessible to all people, which requires i18n. And creating a solid set of PHP and/or JS libraries that are battle tested and work well for such a large project would be pretty valuable in service of that goal. This would require quite a bit of planning, untangling, and refactoring to become a reality.

SSastry (WMF) (talkcontribs)

I am going to +1 @Amire80's observations about immediacy + @Legoktm's proposal about Translate.

Specifically, Niklas and I discussed Translate + VE sometime in 2014/2015, but that at that time, as far as I remember, it was meant to be a VE-only support, which we weren't sure how palatable it would be. If that is not going to be an issue, that is one way to go. However, it is possible that we can come up with good solutions for wikitext, but yes, that requires looking at existing translate use cases (which I am somewhat unfamiliar with) and work through the details. We might be able to do some of this work at the Editing offsite and come up with proposals for further discussion at the Dev Summit.

Nikerabbit (talkcontribs)

I can be brief since many of my thoughts have already been said above.

My wishlist from 2014 is of course little outdated, but I would consider it as a good starting point when we think about language support in wide angle.

From my perspective there are three things which I would consider high priority, but it's easy to classify them as technical debt which makes it harder to argue for them:

  1. Interface language selection for anonymous users. This has been blocked for a long long time.
  2. Alternative wiki page translation mechanism for Translate that works with VE and parsoid and does not make the wikitext to be hard to edit.
  3. Improvements to translation memory. Currently it requires a special plug-in and does not scale in any direction. It handles paragraphs poorly, it doesn't always find results and it doesn't even support multi-dc configuration for increased availability.

I would be happy to propose (1) for discussion to reach consensus and find the ways to achieve it. For (2) I gather there is already consensus and some ideas how to start, but scheduling and resourcing such multi-team work is hard.

I am against taking a deep dive in building machine translation software at the foundation, when we cannot even take care of the simpler task of a translation memory. At most we could discuss in what other ways and places we could use the machine translation services currently used in CX and Translate.

Finally, librarization of i18n code is still a good think to do, but I think we missed the train where we could attract other users of those libraries. PHP and jQuery already have multiple i18n libraries: jquery.i18n which pretty much is a librarized version of MediaWiki code (but we never got MediaWiki ported to use jquery.i18n) has seen, in my opinion, quite limited use.

Oops, that wasn't very brief after all :)

Reply to "WikiDev17 topic: Multilingualism"