GSOC Project Idea : Pronunciation Recording Extension

Identity

Name: Rahul Maliakkal
Email: rahul14m93 gmail.com
Project title: Pronunciation Recording Extension
Location: Ahmedabad, Gujarat, India.

Contact/Working info

Timezone: IST (UTC +5:30)
Typical working hours: Very flexible. I can adjust my work hours to anytime between 14:30–21:30 UTC (20:00–03:00 IST) and can work on the weekends for 5 hours extra.
IRC or IM networks/handle(s): Rahul21 (Freenode)
Time constraints: I just want to be clear up front that I do have a few time constraints to work around. I will be having my 6th semester examination from 22nd may - 31st may, so I'll be a little less active during that time, I've prepared my schedule so that the main part of this project will be complete before September 8th.

I live in Ahmedabad, a metropolitan city with 24/7 power supply and a good enough and uninterrupted Internet Connection. So working online will not be hampered by any means.

Introduction

Tracked in Phabricator
Task T48610

There is a Thread in the mailing list requesting this feature.
In Wiktionary many words have pronunciation audio files (.ogg) attached with them, these audio files help the user to pronounce a word in specific language. Words are pronounced differently in different dialects, which may be regional or ethnic. Example: The word Garage is spoken differently all around the world [Garage].
Differing pronunciations may be described using phonetic representations (such as IPA), but are more readily understood when native speakers record the word as they speak it naturally.
The Wiktionary page of the word behaviour has pronunciation attached with it.

Heteronyms such as 'minute' are two or more words which are spelt the same but have distinctly different meanings, and are made clear in the spoken language: very small (IPA: /maɪˈn(j)ut/) is distinct from a sixty-second measurement of time (IPA: /ˈmɪnɪt/).
But there are several words that do not have audio files attached to them. Conducting a rough survey I found out that words used extensively in a particular discipline i.e. medicine, mathematics, etc. don't have audio files attached to them. Example: aggravate, compendium

Ablepsia
Quadrilateral

Present Scenario

Record a word.
The pronunciation file is edited on the contributor's computer using a range of software (Audacity ,etc) by removing excess blank space and noise at the beginning and end, and often require format-shifting to .ogg vorbis.
It is a manual and a cumbersome process.
This often results in poor quality files as users may store in one lossy encoding and then format shift to another lossy format.
The entire process of uploading a pronunciation file is given here.

Mentors

Michael Dale is my primary mentor and Matt Flaschen is my co-mentor. Both of them have helped a lot in making the basic idea clear to me.

Deliverables

Required Deliverables

Since I plan on building my extension with the help of TMH (Timed Media Handler) extension, I plan on first adding .wav support to TMH.

Tracked in Phabricator
Task T34135

Record 5 second audio pronunciations via HTML5. I plan on using getUserMedia API to access the microphone, use the Web Audio API to get access to the raw data.
Fix some browser compatibility issues.
Implementing the UI wizard flow to record pronunciations.
Store the recordings in .wav format in commons.
Alternative Approaches:
- Transcode at the Wikimedia Labs Server.
  - An Instance that is running timed media handler and can have ffmpeg and ffmpeg2theora installed is required. Use of a scheduling queue will be very important.
- An alternate approach would be to use Firefogg API to convert .wav to .ogg for Firefox.
Embed .ogg flavors of the recordings using the Template:audio in respective Wiktionary pages.
Customizing the style of the UI implementation via CSS classes for various skins available i.e. Vector, Monobook, etc.
Create documentation which includes a User manual for future developers, pages extensively describing the API's used.
Improving on the documentation at a later stage in the project.

If time permits

Expand the idea to record spoken articles.
Implement a rating extension to evaluate the quality of the words recorded.

Improvements beyond the scope of GSOC

Expand the idea for mass uploading of words.
Implement an audio filter that uses Noise Reduction which makes the pronunciation's crystal clear.

Simple workflow

The workflow basically consists of 3 steps:

A Record Pronunciation link is displayed on the Wiktionary page of a word that does not have a pronunciation file attached to it.
When the user clicks on the Record Pronunciation link a dialog box pops up. The dialog box basically consists of 4 parts:
1. The Recording Toolbar: It essentially consists of a user friendly toolbar that would help the user to record pronunciations. It consists of buttons like "Record", "Stop", "Play" and "Reset". The description of each button is fairly self-explanatory. The Recording Toolbar is not shown in the snapshot, the words Recording Toolbar will be replaced by a working toolbar . The user will get a maximum of 5 seconds in which he can record the pronunciation.
2. IPA: This section consists of the IPA of the word that the user wants to record. It will assist the user in pronouncing the word correctly.
3. Choosing a License: To upload a file to Wikimedia Commons requires licensing. If the file the user wishes to upload is his/her own work then he/she can choose from a variety of licenses. When the user clicks on the "This file is my work", then automatically the radio buttons to the 3 licenses are enabled and the radio button corresponding to "This file is not my work" is disabled. This applies vice versa too.
4. Upload Button: On clicking this button the file is uploaded to Wikimedia Commons a with a specific file name like en-minute.ogg. For a different etymology of the same word the file name will be en-minute-1.ogg and for a different language the file name will be fr-minute-1.ogg. The second time a pronunciation is recorded for the word "minute" it will be saved as en-minute-2.ogg.
The Success and Thank you Note: After the user clicks the "Upload" button, if the file is successfully uploaded to commons then a dialog box confirming that upload was successful will be displayed. This dialog box also consists of a small Thank You note and when the user clicks on the "Finish" button, the Wiktionary page automatically refreshes and the .ogg file is embedded into the page.

The workflow that I described is illustrated through a UI mockup. The word aggravate is taken as an example, since it does not have a pronunciation file attached to it.

Present Wiktionary page of the word aggravate
The "Record Pronunciation" with the mic icon
Recording toolbar
A success note that appears on successfully recording a pronunciation
After clicking the "finish" button

Images 2, 3, 4, and 5 in the gallery above are mockups representing functionality that does not exist yet (such as the Record Pronunciation link).
I had a talk with Vibha Bamba from the Design team of WMF regarding the UI layout and she gave me pretty handy suggestions that I will be actually implementing when I work with the UI during the course of the project. Her suggestions can be seen in this thread. Using the new mediawiki.ui module, we can have standardized colored (in appropriate skins) buttons.

Project Schedule aka Timeline

Before May 27th - Familiarizing myself with the MediaWiki codebase and learn about localisation/i18n framework.
May 27th to June 17th - Research thoroughly on my implementation idea, get up close with HTML5 and to gather all possible resources for the coding period and creating a Wikimedia Lab Instance Account.
June 17th to June 23rd (Week 1) - Enabling the .wav support to the TMH extension and reviewing it using Gerrit.
June 24th to July 7th (Week 2,3) - Implementing the Record API using HTML5, this is a bit tricky and will be time consuming.
July 8th to July 14th (Week 4) - Finishing the Record API and will also be solving browser compatibility issues.
July 15th to July 28th (Week 5,6) - Work on the UI by implementing WMF design teams suggestions.

So Before the Mid-Term Evaluation I will have a fully functional audio recorder.

July 29th to August 11th (Week 7,8) - Work on the backend and storing the recordings in commons.
August 12th to August 18th (Week 9) - Customizing the style of the UI implementation via CSS classes for various skins available i.e. Vector, Monobook, etc.
August 19th to September 8th (Week 10,11,12) - Using the Template:Audio to embed the pronunciation into their respective Wiktionary Pages. Testing time, fix bugs, improve the documentation and the UI, and scrub the code otherwise.
September 9th to September 22nd (Week 14,15) - Pre-Deployment Code Review and Buffer period in case I fail to make it up to the schedule and also improve the documentation.

I will submit each significant feature to Gerrit for code review when it is completed.

Browser Compatibility

I plan on recording the pronunciation using webRTC + Web Audio API supported by HTML5, so browser compatibility is a minor issue at the moment.
I have had conversations with developers from Firefox and Chrome, they told me how webRTC has exploded into the scene and since their product release cycles are fast and since this tool will take about 6 months to get fully deployed, I see no issues with browser compatibility then.

Google Chrome

Right Now Chrome m27 (Beta channel), Chrome m28 (Dev Channel) supports audio recording through microphone.
Google Chrome canary has been supporting audio recording through microphone since m23, by enabling flag "Web Audio Input" via "chrome://flags".
Chrome Development Calender

Firefox

Firefox v20 has a bug with audio recording and is expected to be fixed soon.Link to the Bug Report
Firefox Release Dates

Internet Explorer

WebRTC support for Internet Explorer has been tested on Chrome Frame for Internet Explorer users in non-metro mode.

Benefits

Audio Recording will be fully supported by all browsers at some point, and Wikimedia foundation will have a tool to record pronunciation's by then.
The primary benefit is laying the groundwork for contributor-created audio to MediaWiki sites in any current browser.
Secondary benefit is to support the Wiktionary project's goals of being an education resource - especially for learners of a second language.
Further benefits include creating a framework on which language learning can be based, collecting a corpus of tagged linguistic recordings suitable for research, and tools for creating oral histories.

Participation

The mantra that I follow is simple "The more you ask, the more you learn".
I do have frequent conversations with my mentor and co-mentor via the IRC or email. While I'm using my laptop, I am always logged on to IRC and can be easily reached at #mediawiki, #wikimedia-dev, or by private message. My IRC handle is Rahul21. I frequently submit patches in Gerrit and I am also planning to blog about my experiences with MediaWiki especially this project, probably weekly.
Synchronous feedback via Skype and hangout.
The important thing in an interaction via IRC or email that I have realized is to have patience. During the times when I am waiting for a reply from my mentor or the community, I generally tend to sit back and analyze the problem or start working on a different task till I get the reply needed.

My Previous Open Source Experiences

MediaWiki is the first organization in the world of Open Source Programming wherein I have participated actively.
I have worked on 14 bugs and 9 of my patches have been successfully merged into the codebase. The best thing I feel about bug fixing is that each bug brings out a new facet of the MediaWiki codebase and helped me a lot in learning about the Directory Structure. Amongst the bugs I have hunted down my favorite ones are:

bug 46514 - Merge the 'editwarning' component of Vector extension into core Patch
bug 34798 - Diffs on recent changes feed should have the same formatting of the on wiki diffs Patch

A complete list of my bug fixes and submitted patches. Patches Submitted

About You

I am a 3rd year Computer Engineering student at U.V Patel College Of Engineering. I love soccer and like to travel a lot.
“Never give up on what you really want to do. The person with big dreams is more powerful than the one with all the facts.”
I program in my free time and have a fairly good understanding of C, PHP, HTML, CSS and can paddle around with Java and Python.
Like every common Wiktionary user, when looking for the pronunciation of certain words I was surprised and left disappointed on not finding them. That is the simple motivation behind my project proposal. Few months ago I never thought I'll be writing this proposal, but now I have the confidence to do it.

Acknowledgments

Discussion

Feedback, suggestions and discussions were held mainly on the Discussion Page of my proposal.
Inputs were also given in the comment section of the Bug Report.

Firstly I would like to thank Michael Dale and Matt Flaschen for their constant monitoring. The UI mockup that I have prepared is a result of several suggestions that I received on the Discussion Page and also to the valuable inputs given by my friends and college professors who are regular users of Wiktionary. A suggestion given by my friend that i found particularly very useful was to add the IPA of the word in the dialog box. After rigorous scrutiny I had to pick the suggestions that would be user friendly and efficient. Special Thanks to Amgine, Sumana and Quim.

Extension:PronunciationRecording/GSoC 2013/Proposal

Contents