Chemical Markup support for Wikimedia Commons

Chemical Markup support for Wikimedia Commons

edit
Public URL
https://www.mediawiki.org/wiki/Chemical_Markup_support_for_Wikimedia_Commons
Issue tracker
Maniphest (a "Phabricator" application)
Issue tracker for Extension
MediaWiki Bugzilla
Task board
Project Board (a "Phabricator" application)
Repo
Gerrit WM
Bugzilla report
bugzilla:16491
Announcement
wikitech-l, commons-l

Name and contact information

edit
Name
Rainer Rillke
Email
<lastname>@wikipedia.de
IRC or IM networks/handle(s)
rillke
Web Page / Blog / Microblog / Portfolio
https://commons.wikimedia.org/wiki/User:Rillke
Resume (optional)
http://osrc.dfm.io/rillke
Location
Germany
Typical working hours
08:00 - 20:00 UTC; Tue, Wed, Thur: 08:00 - 16:00 UTC

Synopsis

edit

Wikipedia articles covering chemical reactions or chemical compounds are often illustrated with SVG graphics showing chemical equations or compounds. However, SVG is a graphic format. It is therefore not possible to easily re-mix these fils and one has to draw the whole compound again (or pull it from a database). A common scenario is "Quack" started an article about a compound and "Cheming" wants to contribute how to synthesize that compound. "Cheming" has to re-draw the whole compound.

Goals

edit
Server-side support

Allow uploading and implement rendering for MDL-molfiles. The format is specified, human readable and commonly used.

“The molfile is sufficiently common that most, if not all, cheminformatics software systems/applications are able to read the format, though not always to the same degree. It is also supported by some computational software such as Mathematica.” -en:Chemical table file

There are client-side JavaScript creators for web browsers available and 2image converters for the server side.

Client side molecule editor

Provide a JavaScript molecule editor so editors do not have to install software and then go through the hoops, choosing the correct format(ting) and uploading the files. File upload can be accomplished by AJAX.

Possible mentors
Gilles Dubuc, Brian Wolff, Bryan Davis

Deliverables

edit

Please describe the details and the timeline of the work you plan to accomplish on the project you are most interested in (discuss these first with the mentor of the project):

Below a list of existing client-side and server-side solutions.

Sever side or fully integrated solutions

edit
approach / third party dependency pros cons
Client side SVG creation; embedding molfile into generated SVG less security issues anticipated

rapid deployment possible
molfile editor would be naturally included
GSoC student has good knowledge of JavaScript and HTML and likes it and has already written an interface for Ketcher (so it could be replaced by any other editor implementing the required APIs)

creation of some kind of personal format spec.

only users with UAs supporting SVG creation would be supported

AGPL as license:[dubious ]
The only (nice) SVG creating molfile editor I found is Ketcher by GGA; there is another one but this is compiled from Java with Google Webtoolkit and I don't even want to look at the output

Users could trick the system violating integrity; the user renders the file - molfile could be different from SVG

Server-side molfile rendering: indigo-depict

http://sourceforge.net/projects/sdf2svg/

pure PHP, easy to review, rewrite and deploy
public domain
SVG must be susequently processed by rsvg for thumbnails

PHP is not the fastest approach
SDF or Molfile must be validated upon upload
PHP must be security-reviewed

Server-side molfile rendering: indigo by GGA

https://github.com/Rillke/indigo
Ubuntu starting from 12.10 includes python-indigo in its package list
In case you use another distribution of linux: Linux64 Binaries

maintanance by a notable company

precompiled binaries avalable (good for testing)
should be fast because it's native code

GPL v.3

requires installing binaries or phpize or something like that

requires c/c++ security review

Server-side molfile rendering: ChemAzTech

http://sourceforge.net/projects/chemaztech/

incoporates a lot of features and is already translated to french
GPL v.2
python for converting to images required
Server-side molfile rendering: OpenBabel

https://github.com/Rillke/openbabel

a wide variety of formats supported; C/C++, native code — fast processing with almost zero impact on servers expected (since chemical markup is not too commonly used in WMF projects); Ubuntu package for Ubuntu 12.04 available
GPL v.2
similar to indigo a large framework

The client side

edit
molecule editor pros cons
Ketcher by GGA draws on SVG that could be sent and stored at server

AGPL[dubious ]

advertising must be removed

ChemDoodle Web Components

GPL v3.0

nice and fast

code without any helpful comment; looks like concatenated from multiple files, but is still readable, however far away from being a pleasure

draws on canvas

advertising must be removed

JSME

draws on SVG that could be sent and stored at server

BSD license (the compatible one)

Windows 3.1 look

SVG produced doesn't look well

JS compiled from GWT/Java - almost impossible to read the codegibberish produced

kemia

draws on SVG that could be sent and stored at server (in theory)

Apache version 2 license (which should work as MW is GPL.v2+, meaning GPL.v3 as an option) but probably not preferred

looks like not completely ready, yet

Although not discussed with the mentors yet, I believe the most viable option in regard to achieving the goal having a working prototype or better, advancing into production, is using indigo-depict + ChemDoodle Web Components.

Project Schedule

edit
Task Timeline Remarks Status
Setup environment (vagrant, gerrit, git), /microtasks 04/03/14 - 28/03/14 took a look at vagrant, other stuff was installed before   Done
Create GitHub repository, legal check, etc. 28/03/14 - 07/04/14 Code will be hosted at Wikimedia Git. The repo will be name after Extension:MolHandler, thus mediawiki/extensions/MolHandler. But I will run a -dev repository at GitHub allowing me to push changes quickly, creating as many branches as I like and to test different options and still showing that I am not idle.   Done
Aim for a working proof-of-concept 08/05/14 - 20/06/14 get the whole pipeline running, even if it only works on a local install, the code isn't clean or tested, etc.; something that shows that all the moving parts can work together from upload to file page with a generated thumbnail   In progress
Mid Term Evaluation 20/06/14 - 23/06/14
Prettify, prepare for production 23/06/14 - 15/07/14 making it clean, giving it test coverage, and writing all the things necessary for production deployment (presumably things like puppet scripts to deploy the server-side things, production config changes, etc.).
Writing Documentation, Deploying changes to labs, letting folk testing there, then WMF side deployment 15/07/14 - 10/08/14
Final Report Submission 20/08/2014

Workflow

edit
 
Workflow + rough UI mockup
  1. Client on .mol file (existing or non-existing file)
  2. Client loads molfile editor. Editor allows import/export of molfile, export of SMILES and export of SVG (server created SVG).
  3. User edits file and saves
  4. FormData is used for file upload
  5. Molfile is stored; do MDL molfiles contain notable metadata that have to be extracted or converted?
  6. SVG is created from .mol file through indigo-depict and stored - file name, where?
  7. SVG is thumbnailed through rsvg (building on existing SVG support/approach) creating PNG thumbs

Non-obvious challenges

edit
  • Either molfile editor gets a full security audit (we might even consider prettifying and adding comments to the source code [creating something maintainable], although not nice becasue upstream library) or it is inlcuded through an <iframe>, loaded from a different domain
  • Internationalization of the molecule editor
  • Option for turning on/off atom coloring on a per-site and per-inclusion basis: [[File:Benzene.mol|150px|atomcolors=off]]

sdf2svg

edit
  • Aromatic bonds not shown
  • Some editors write $RXN into molfiles... sdf2svg should be able to read this
  • Padding often too small cutting off atom lables
  • --> We went with indigo-depict.

Participation

edit
  • Style: MediaWiki extension, similar to Extension:TimedMediaHandler or Extension:PagedTiffHandler
  • Progress and experiences will be logged at /office desk (including future visions, what's missing etc.) and more in a more narrow frame at /microtasks (commits, code review).
  • Code will be hosted at Wikimedia Git. Git/New repositories/Requests. The repo will be named after Extension:MolHandler, thus mediawiki/extensions/MolHandler. But I will run a -dev repository at GitHub allowing me to push changes quickly, creating as many branches as I like and to test different options and still showing that I am not idle.
  • Every time I commit something to the mw-repo, it will have to be reviewed, thus I learn how to do it correctly. However, do not expect me committing something to that repo every day; but at least once per week.
  • MediaWiki has great help resources for self-study (this wiki, doxygen generated stuff and finally the source code looks also sane) but for "best practices" I will certainly need the help of my mentors. Expect me asking a lot of "What is the best approach for … "-questions, especially regarding the PHP-part. This is also the reason I wish two mentors knowledgeable with file handling on the server side. Dependent on what turns out to be more efficient, I'll bug them with e-Mails or on IRC.
  • I'll occasionally notify and gather feedback at project chemistry at Wikimedia Commons so it's not going to be vapourware for the reason not being accepted.

About you

edit
Education completed or in progress
In progress — something closely related to the enhancements the extension will evolve. But well, I am German. I am careful when it comes to sharing all kind of data with the whole world. In other words, I would appreciate if you won't force me publishing anything specific.
How did you hear about this program?

I read a post on a mailing list complaining raising the point that there wouldn't be enough diversity regarding the origin amongst the applicants. I intended to change that with my participation.

Will you have any other time commitments, such as school work, another job, planned vacation, etc., during the duration of the program?

Some of my time will go into the Pronunciation Recording Gadget. But this has a wider schedule and I'll have plenty of time this spring/summer. Otherwise there are no specific plans for activities like internships or vacation, yet.

We advise all candidates eligible to Google Summer of Code and FOSS Outreach Program for Women to apply for both programs. Are you planning to apply to both programs and, if so, with what organization(s)?

Outreach Program for Women? Without looking into the details but *I think*, this doesn't apply to me.


Past experience

edit
Please describe your experience with any other FOSS projects as a user and as a contributor
I could do this providing links to Gerrit, GitHub and Special:CentralAuth/Rillke and telling you to look though the rights logs and contribs as well as the user pages but here is a brief summary about my experience at Wikimedia: In 2010, I registered at Wikipedia, became Wikimedia Commons addict in 2011 and learned a lot about JavaScript, administrator in November 2011 (so I was able to maintain my scripts), started reporting bugs at Bugzilla in 2012 and using Toolserver and Gerrit in 2013. In 2014, I created some tools at Toollabs (learning PHP) that are still up and running. Most notably the database query services: OctoData and sha1lookup for the old_image table which is not exposed through regular mw-API. I have created documentation using JSDuck but the software it is for isn't in use yet ...
Please describe any relevant projects that you have worked on previously and what knowledge you gained from working on them (include links)
So far, I've mainly contributed to Wikimedia projects, for example UploadWizard but also wrote a bunch of user scripts at Wikimedia Commons: FileAnalyzer, VisualFileChange, GalleryTool, a script using chunked upload protocol, Title checker and maintain a lot more (more that I am able/ or let's say is fun to maintain, given all the recent JavaScript deprecations). Furthermore, I worked with molfiles and a molfile editor in the past. Proof can be provided upon request, discretely. Not to forget the daily media-related work at Wikimedia Commons.
What project(s) are you interested in (these can be in the same or different organizations)?

I prefer projects where I can see the light at the end of the tunnel, and where past experience has proven they're successful, hence my late registration at Wikipedia. Thematically, I like projects around chemistry, media files, uploading, involving communities and feedback cycles. I believe that asking users that are target of the software about their needs, by using specific questions and coming up with different suggestions is a crucial part of software development. Head over to Meta, if you want to see these points proven.

Any other info

edit

See also

edit