Open Source Language Summit - November 2013

Schedule: http://open-source-language-summit-2013.shdlr.com/grid
Twitter: Hashtag #languagesummitpune
IRC: #mediawiki-i18n on FreeNode

Day 1

Session: Input Methods on VisualEditor (includes jQuery.ime integration)

Notes [1]

David Chan leading; sets off introductions from everyone round-table
Santhosh introduced jQuery.IME and explained what it is for, why it was built
David outlined how bug-filing helps - the importance of very specific version numbers, exact keystrokes to fire the IME, and expected and observed behaviours, and the problems facing comprehensive IME support
David demonstrated the EventLogger system capturing IME input event streams, giving detailed run through of several IMEs and the events that they can create
David showed the draft automated IME testing framework he has built for VisualEditor and explained his intention to build a library of as many languages, IMEs and OSes as possible to test them.
Santhosh discussed how jQuery.IME can help simplify the needs in VisualEditor because it doesn't operate in a different way in each script/browser/OS
Santhosh demonstrated problems like multiple different conflicting numbers (e.g. cursor positions vs. key strokes vs. Unicode code points vs. backspace positions)
Santhosh returned to the reasons why IME difficulties are an issue for VisualEditor, due to the need to do non-native programmatic management of the contentEditable surface to support generated content blocks like images or templates
Pau asked about the relative value of on-screen keyboards, predictive type, spell-checking, hand-writing recognition etc.

General discussion about possibilities and requirements from Indic scripts

Particular requests for VisualEditor
Support for native IMEs – especially for users with Windows as their OS
In-built IME in VE (e.g. expectations of auto-convert on space/save)
Auto-completion based on dictionaries

Volunteer language experts for Indic languages

Samyak Bhuta, for Gujarati, samyak.bhuta @ gmail dot com
Vijay Languages Marath,Hindi,Sanskrit,Nepali,Ahiranii mahitgar at yahoo dot co dot in

Retrospective: (David)

Good mix of participants (technical and non-technical Wikipedians, OSS contributors)
Brainstormed about handling complexities of input tools for Indic languages, trapping keystrokes, event ordering, DOM model, event logger tool
Log submission now available, please contribute! URL: http://tinyurl.com/imelogform
URL: https://bit.ly/ve-eventlogger, https://bit.ly/ve-imefeedback
- Submissions for Indic language IMEs are especially welcome
OSKs vs Latin keyboards advantages/disadvantages
Learnt a lot of Indic languages - bilingual usage, code switching, switching across languages, Issues around ime usage
Santhosh - identifying the problem definitions, patches in progress
Abhijit - highlighted cross-browser, cross-platform differences; working with original core developers,
David - in developing for ibus - why are event sequences diferent? may not be possible for languages (HPN)
OSK - T9 input optimized for mobile usage - standardized (Hari)

Session: Cross project coverage for basic language support components

Notes:

Showcasing the Language Coverage Dashboard

Desktop Support Requirements: (Pravin Satpute talking about the Fedora world)

Character Encoding
Fonts
Shaping Engines
Input Methods
OS Level Support
Locale Definition (CLDR)
Minimum Criteria for Language Support
If an ISO code does not exist, the language cannot be used on the desktop

Desktop Enhancements

Plan to check the language coverage in WMF projects for standardized ISO recognised language and assess coverage for Desktop language support]

Retrospective (Runa)

Overview of what the GSoc team developed, features developed and demo'ed, plans for future visualizations and features
Fedora desktop support features as use case for LCMD (ISO-less languages are not handled for desktop)
Extending for Fedora desktop
Suggestion from Hari Nadig - data from LCMD can be used through a Mediawiki extension for Indic language Wiki projects to show some stats for Indic language projects (which is being developed
Will evaluate for Fedora desktop and implement (next step)
Is there an option to contribute instead of forking CLDR
CLDR - contributing to it instead of forking - experts should review

Session Name: FUEL Sessions

During the today's Language Summit (18th November, 2013), we discussed about the existing FUEL-colors module. It was observed that the current one is not so definitive and came up with following points:

we will follow the list of colors given in http://www.w3.org/TR/css3-color/
we will be creating two modules, fuel-colors-basic, fuel-colors-extended.
fuel-colors-basic: http://www.w3.org/TR/css3-color/#html4
fuel-colors-extended: http://www.w3.org/TR/css3-color/#svg-color
This is just a proposal. If you have any issues or suggestion let us discuss here.
We will be closing this discussion probably by 30th of November and of course we can extend this date, if the discussion is prolonged.

FUEL - Translation Quality Assessment Matrix:

Translation Quality Assessment Matrix (TQAM) is a first matrix to assess translation quality under open license.
The participant accepted that broadly it is helpful for a translation team, community, translator or editor.
Got some suggestions related to UI of the TQAM

Retrospective: (Siebrand) (on all 3 sessions on FUEL through the day)

What is FUEL - Siebrand provided a short blurb on what FUEL is
Objective to make localizations more consistent
There 3 collections currently and 3 in progress (color, number, date/time)
Colors discussion:

250 colors taken from Wikipedia categories were reviewed, (xkcd ref: 15 instead of 14 standards - yet another standard?)
Instead of trying to create a new collection, reusing is better - looked at W3C CSS standard - 131 colors
Cultural bias in defining collection colors? Do we need to remove this cultural bias?
Or have the standard changed?
Name of the color should be localized not re-invented

DTTM discussion:

Interesting discussion, CLDR has a few flaws - have to pay money to vote on what goes into the collection
Paid members choose from contributions
FUEL strategy is to create a new standard; inconclusive discussions
Few options on the table:

Fork CLDR
Work with CLDR and find ways to collaborate
Create a competing standard

Will be discussed on mailing list - progress - as to what to do next and how; Siebrand will send this email

Number discussion:

List was created with 1-100, ordinals
Part of this collection was out of scope for FUEL
Translators localizing numbers may not be useful
Ordinals could be used as adjectives so is not as easy as it looks
Rajesh will send an email to the FUEL mailing list and then decide how to fulfill those functional requirements

Session Name: Keyboard layout Images for documentation of input methods

Notes

Latest inscript2 keymap images are captured and saved at [2]

Languages that Need Help Documents: [3]
Image Generation script: Python script that takes keymap filename input and shows mappings in UI.

This works only for 1:1 mapping keymaps

Retrospective notes: (Parag)

Documentation done but to be uploaded
TWN - feedback can be provided through Ask a Question?
WMF - wikis - there is a huge problem directing these user questions, no centralized system to process comments from users (Siebrand)
We do not have a specific feedback method at the moment other than the talk page. (Pau)

Session Name: Leveraging content translation platforms for Indic languages

Notes

Microsoft Research
Translation platform demo
Discussion on various content translation components in MSFT and Google
Web based data is key to training MT engines

Session Name: Updating Lohit2 fonts to conform with the new Open Type spec for Indic scripts

Notes

Presentation on Idea behind lohit2 (http://pravin-s.blogspot.in/2013/08/project-creating-standard-and-reusable.html)
Depth discussion on Adobe Glyph Nameing guidelines and problems
Demonstration on Kannada Work done by Aravinda (https://aravindavk.in/blog/improving-kannada-fonts/)
Sneha presented on Process followed for Lohit2 Devanagari, Gujarati
Santhosh presented on GSoc automated testing project.

Retrospective (Sneha)

Session on Lohit 2 improvements
Adobe glyph list - clarifying doubts
Aravinda - talked about Kannada block - script specializations
Sneha - Walkthrough of development process for Lohit
Santhosh - walked through automated testing process

Session Name: Packaging fonts

Notes

Fonts available in Debian
Fonts available in Fedora
Packaging as much as fonts in Debian, Fedora and other distribution so that it won't load as 'webfonts' (61 fonts in repository) when use is accessing Wikipedia pages.
Debian/Fedora

Compare fonts in ULS, Debian and Fedora (see links above).
Package missing fonts for Debian/Fedora.

ULS

Write automated 'New upstream' check for ULS.
Update to new upstreams: https://gerrit.wikimedia.org/r/#/c/96008/

Fedora bugs filed:

https://bugzilla.redhat.com/show_bug.cgi?id=1031587 (tharlon-fonts)
https://bugzilla.redhat.com/show_bug.cgi?id=1031588 (phetsarath-fonts)
https://bugzilla.redhat.com/show_bug.cgi?id=1031603 (tuladha-jejeg-fonts)
https://bugzilla.redhat.com/show_bug.cgi?id=1031569 (cdac-sakal-marathi-fonts)

Debian bugs filed:

http://bugs.debian.org/cgi-bin/pkgreport.cgi?tag=languagesummitpune2013;users=debian-in-workers@lists.alioth.debian.org

Retrospective (Kartik)

Packaging consistencies across Fedora, Debian, Wikimedia
61 fonts in ULS repo - checked in Fedora or Debian - if missing adding these fonts to Fedora and Debian - Vasudev
Aksharyogini and Sakal Marathi and Meera Tamil are getting added
Defining mechanisms to maintain fonts so how can we automate process (Kartik will work on this)
Fedora and Debian have mechanisms to automately check

Session: Identify and document the sources of free licensed bilingual dictionaries

Notes

Mediawiki Page
Is there any free licensed licensed bilingual dictionaries?

freedict: is client/server model 'dictd' protocol.
freedict only available for Hindi (from Indic languages).
Artha: http://artha.sourceforge.net/

APIs

No 'well defined' Wiktionary API: will take many months to have it with wikidata.
Write or use API where it can be available.
GujaratiLexicon.com API: Kartik/Samyak to work.

Retrospective

Created an useful document on mediawiki.org

Session: Q & A with Behdad Esfahbod: State of the Union: Harfbuzz - Font rendering for Chrome, Android

Santhosh: How to do testing better for Indic scripts
Windows - testing is key - Behdad tests all bug fixes on Windows
Open Type support on IOS
Apple has full open type support now (google is more competitive w msft than appl is since it wants feature compatibility)
Firefox has been shipping with Harfbuzz on every platform
Jonathan Kew is working on testing infrastr for Windows
Webfonts - my vision for the web - a font file should run on every web browser
Mobile web fonts - noto was designed to support this use case
Open Type Spec for Google Noto fonts

Day 2

Session Name: Indic Font Specification

Notes Github repo

Status

Slow progress so far due to being an open call for participation
Script ownership needed

Process:

Add documentation to github that you don't need to know latex to contribute, doc formats work, wikis don't work for illustrations
Regular process sync-up needed
A new mailing list will help

Participants
Kannada

Aravinda VK
Hari Nadig

Malayalam

Santhosh

Devanagari

Alolita
Kartik
Ravi Pande

Gujarati

Kartik

Bengali

Runa
Alolita

Gurmukhi

TBD

Tamil

TBD

Telegu

TBD

In the 'General Section' add:

Jargon - Conjunct formation rules
Controversies on ligature usage in each language
Sans vs sans-serif (equal stem width vs sans-serif)
Alternate styles are being examined
Italics are alien to Indic scripts

Other sections needed:

Glossaries of terms from each language (e.g. virama, halant, pulli...)

Next steps:

Collaborative efforts with IITB, NID etc

Devanagari - TDIL recommendations [4]
Reference docs:

http://www.w3.org/TR/jlreq/

Can all who contributed/prospective contributors be invited to join the github repo?

Retrospectives

Session Name: Autonym Font

Notes

Github: https://github.com/santhoshtr/AutonymFont
Santhosh walks through autonym font using for language names
460 characters currently
Andrew Cunningham - contributed patches
List of languages is from CLDR
Wikipedia supports 287 languages; 300+ for other wiki projects,
Use case does not need punctuation (fallback to system font)
Scaled for consistent height and width
Tests need to be completed for autonym font
Legacy systems cannot handle hinting (e.g. Windows XP)
Full parentheses - different code point
Source code available on github
Issues open on autonym font
Cross-browser, cross-platform testing
Maintaining size optimizations by not defining rendering rules
Varying stem width - serif
Uniform stem width - sans-serif
Monospace - doesnt exist in Indic languages

Session Name: Content Translation UI prototype testing session

Notetaker: Jared Zimmerman

Notes

Its a tedious process to translate articles from English to Gujarati, I use google translate as a means of getting the gist but have to manually translate
I open two windows and manually translate between the two
I use google translator toolkit, they have a link to pull in the wikipedia article automatically, with a split screen interface, with translation suggestions, its better than nothing, but could be better. One thing that is nice about it is that you can collaborate with other users at the same time. (Pau) Do you use that?(/Pau) Yes, I have.
"For new (non-english speakers) users to wikipedia translating should be one of the easiest tasks you can do
(?) if linked article does not exist, would it be better to red link vs linking to original language article or wikidata item(?)

Testing Sessions

P1 - Entry-point discover

translation entry-point isn't obvious. (I (P1, Nayan) was searching at top right corner for "translate" button/link )
would copy the name of the article to google the title to find a version in his language, and if it didn't exsist, start translating it
(?) more obvious translation entry-point
"honestly I've never noticed the language list"

Amir: It's anecdotal, but lots of people say this

If I intend to translate it into my language I'll first go search for that version in my language before trying to create it
The call to action was clear, users inclination was to translate from scratch rather than start with english as a starting point "It easier to translate from scratch, since the language is so different, it reads so differently in Kanada"

Amir: The prototype is build only for Dutch (Nederlands), so it's less discoverable for people who don't speak Dutch.
Amir: The box that opens when the red interlanguage link is clicked mixes and English message with the Dutch autonym ("This page is not translated to Nederlands"). We may consider doing this whole box in the target language. Jared: or both languages?

P2 - Translation Dashboard

understands the buckets of in progress, completed, etc
(?) why show the same language as both a origin and target language
unsure why he would change the title of the article on the creation screen
Hard for user to understand exactly what's going on with translation variations (because prototype is in Dutch)
user understands general principle that there are options that the system is showing him
Why do some words have translation and some have "word information" I want both types of information for all the things that I would click on
when interacting with interwiki links, unclear what "Paste source link" is
(?) perhaps expand interwiki link action rather than hiding them in a dropdown

P1: When you select words on the translated text it doesn't highlight the text in the original text?

P3 - Translation workflow

(?) "Add translation" should be in destination language? or both

Amir, technical comment: The prototype is probably made for Chrome. The "Add translation" button is displayed incorrectly in Firefox. However, Chrome has issues with rendering Bengali correctly.

is this an automatic translation? where is it from
What is the source of the automatic translation
Pau enables input methods
user adjusts auto-translated text
Pau: Can you tell what part you've manually translated?
Once user has scrolled down it wasn't immediately obvious what percentage of the article that she'd translated/manually translated
Subtle progress bar was not noticed, and the color difference between auto and manual was not noticed
(?) showing the yellow warning box might be a little annoying for users who want to start with all auto-translated then rewite from that point
(?) maybe show little flyouts when the translation percentage changes with the number
P1 : I'll probably only translate articles that I have some familiarity with
P1 : I saw the progress bar but its not that noticeable, perhaps have it be the full screen width
Groups Brainstorm

Would want to see side by side translation from different services

show all available services in right sidebar, including the one that is already displayed as the proposed translation (with the default highlighted)

replace digits (numbers)

seems like this might be an issue with the translation provider?
user will likely have to translate these manually for now (since they aren't automatically translated by service)

Amir: the left and right column don't synchronize properly(?)
for interwiki links in target language (redlink/remove/source lang/wikidata) consensus seems to be between remove or red

special syntax for redlinks to wikidata (greylinks?)

Use untranslated links from articles that I've translated as suggested articles that I can translates since they are already likely to be in my interest area.
Collaboration : we created "translation drive" with the articles that we were currently translating, with google translation toolkit, people would claim individual articles to contribute to.

there were few (no) instances of people actually collaborating on a single article together.
"I don't know if my translation is good enough to be published"

Will this interface be used for translatewiki.net or vice versa? (No, not right now this is optimized for long form content with links, not short interface strings like translatewiki.net)
Provide corrected information back to translation services as a means of convincing them to provide translations to us

Retrospective

Session Name: Onscreen Keyboards

Notetaker: Amir

Notes:

Pau and Praveen are having a 6 8 5 session, sketching ideas for keyboards, auto-completion, spelling etc.
Nayan showed a sketch for a transliteration typing tool. Hary Prasad Nadig showed an existing implementation of a similar idea in Mac.
Amir: a context menu to share additions to the spelling checker to a network dictionary (Wiktionary, Wikidata, OmegaWiki, something else, whatever). Crowdsource dictionary building. Comments:

How to know which language? (A: by the lang attribute)
How to check if it's correct? (A: A maintainer is needed.)
Suggestion: Automatically add to the local dictionary.

Retrospective

Session Name: FUEL Sessions - Demo: Translation Quality Assessment Matrix

Notes

Retrospective

Session Name: Rendering of fonts on mobile apps

Notetaker: Amir

Notes:

There are many more mobile browsers than desktop browsers.
CSS creates problems.
Japanese may become vertical without a reason when it's supposed to be horizontal.
Webfonts:

Works fairly well on desktop.
Has an initial implementation for mobile.
Uses the HTML lang attribute.

Webfonts may be very heavy on the bandwidth.
Identifying fonts on the client:

Render a name and measure the result size: if it's tofu, it will have the same expected size. If it's different, then it works.

Retrospective

Session Name: "Lohit-ising" Open type fonts

Notetaker: Kartik

Notes:

Standardise the Indic fonts according to Lohit2.
Example fonts: Samyak, Sakal Marathi.
Follow AGL, Unicode specification.
Standardise the glyphs names. Discussions: Ravi, Pravin, Aravinda.
Query to lohit-dev mailing list where glyph names differs.
Pravin/Sneha: No awareness around AGL.
Kartik: Kalapi has unused/extra glyphs which can be standardise according to Lohit2 Gujarati glyphs.
Standards v/s Typography discussion.
TDIL Standard Devanagari Script Behaviour document can be used as reference.
Ravi: AGL is specification, it is not standard yet.
Recap of steps by Pravin.
Testing is ongoing for Beta Lohit fonts.
Gujarati Lohit 2 is in Alpha stage, but can be used as an example.

To-do

Pravin to blog about 'steps' for Lohit-ising the OT fonts in details, although Sneha/Pravin's blog post contains background and needed information.

References

Session Name: ibus-typing-booster - predictive text typing system

Notetaker: Pravin Satpute

Notes

Presented by Anish and Pravin on idea behind ibus-typing-booster and what are the features and how it will be helpful over the time.
During testing we got around 4 bugs from audience.

Retrospectives

Session Name: Fedora SIG - UI Source Message Contextualization

Notetaker:Nilamdyuti Goswami

Notes:

The session was about the need of contextualization in source strings of applications to ensure correct translation and correct convey of message.
It was presented my Shankar Prasad where he showed the situations where context is genuinely needed in the source strings and also how context is
added to the source code.
Siebrand gave some useful advice on how to write context.
The group formed for source string contextualization is named as SSCG (Source String Contextualizing Group)
Fedora SIG Page

Wikimedia Language engineering/Pune LanguageSummit November 2013/Event Notes

Contents

Open Source Language Summit - November 2013

Day 1

Session: Input Methods on VisualEditor (includes jQuery.ime integration)

Session: Cross project coverage for basic language support components

Session Name: FUEL Sessions

FUEL color module:

FUEL date and time module:

FUEL number module:

FUEL - Translation Quality Assessment Matrix:

Session Name: Keyboard layout Images for documentation of input methods

Session Name: Leveraging content translation platforms for Indic languages

Session Name: Updating Lohit2 fonts to conform with the new Open Type spec for Indic scripts

Session Name: Packaging fonts

Session: Identify and document the sources of free licensed bilingual dictionaries

Session: Q & A with Behdad Esfahbod: State of the Union: Harfbuzz - Font rendering for Chrome, Android

Day 2

Session Name: Indic Font Specification

Session Name: Autonym Font

Session Name: Content Translation UI prototype testing session

Session Name: Onscreen Keyboards

Session Name: FUEL Sessions - Demo: Translation Quality Assessment Matrix

Session Name: Rendering of fonts on mobile apps

Session Name: "Lohit-ising" Open type fonts

Session Name: ibus-typing-booster - predictive text typing system

Session Name: Fedora SIG - UI Source Message Contextualization