User:Bawolff/GSoC2010

Identity

Name: Brian Wolff
Email: special:emailuser/bawolff or bawolff+gsoc gmail dot com
Project title: Improve metadata support for uploaded media in mediawiki by displaying embedded IPTC and XMP metadata

Contact/working info

Timezone: MST (UTC -7)
IRC or IM networks/handle(s): bawolff (on #wikinews, #wikinews-en and #mediawiki on freenode)

Project summary

Currently mediawiki only supports displaying jpeg exif metadata on image pages. Some other metadata can also be returned using the api prop=imageinfo&iiprop=metadata [1] however this is not that useful to the average viewer since it is not displayed. I propose as my project (should I be accepted and all that) to improve Mediawiki's support for metadata of uploaded files. This would include extracting metadata for more media formats, and displaying the metadata on image pages where it is useful. Considering Wikimedia's general stance on copyright/Free-ness, I think being able to view file metadata, especially copyright related metadata, would be a benefit, especially for projects like commons.

About you

I am currently a first year computer science student. I first became involved with the wikimedia world as a contributor to Wikinews several years ago. I also have commit access to mediawiki, and have made some patches, but nothing major as of yet. I would like to participate in Google summer of code as i think it would be a great way to become involved in mediawiki development, as well as an excellent learning experience.

Deliverables

Required deliverables

Have some method to output more complex metadata displays then the current table.
Improve oggHandler to display the Comment metadata on ogg files (currently collected but not displayed)
Output currently collected metadata for pdf files
Display IPTC for jpeg image
Parse and display (common) XMP metadata for:
- png
- jpeg
- tiff
Parse metadata for svg (There appears to be some work already done towards this end on bugzilla:12649, but it required a dependency)
Display metadata for svg

If time permits

Support XMP metadata in other formats (pdf and djvu perhaps)
On wiki method of editing metadata in jpeg and png files (bugzilla:3361)
Perhaps do something with extracting album art from ogg audio files. (not sure if that would actually be useful or not)

Project schedule

Per the fixme in MediaHandler, make formattedMetadata() more flexible, while retaining backwards compatibility.
Since ogg comment data is already parsed, implement formattedMetadata() to display the relevant metadata.
Similarly for pdf's (although the currently parsed pdf metadata is rather uninteresting)
Add IPTC data to the metadata collected from jpegs.
Address concerns in bugzilla:13172#c8 about having new versions of the metadata could cause heavy server load when all the old cached metadata is purged somehow (?)
On formats that have multiple metadata embedding methods, determine a way to present the metadata such as to distnguish between the formats (For example, what to display if a jpeg file has an exif description, and an xmp description, and the two descriptions say different things).
Extract the XMP info from jpeg
Parse extracted XMP metadata
Add this xmp to the metadata displayed for jpegs
Extract XMP metadata from png's
Make png files output the collected metadata
Extract XMP from tiffs
Make tiff image page output metadata
Extract and parse metadata from svg files (from what I understand, this is mostly dublin core rdf)
Display the svg metadata
If time still remains, try and implement a method of on-wiki metadata editing.

Participation

I plan to hang around #mediawiki and wikitech-l, as those seem to be ideal places to ask for help, and learn from others. I expect to talk with my mentor quite regularly, and ask for advice when needed, as well as receive comments on code review. I expect to submit code as a branch in svn.

Past open source experience

I've made some minor patches to mediawiki, and I also maintain several javascript tools on wikinews, however beyond that i do not have much experience.

Any other info

I was thinking that for addressing the issue of MediaHandler::formatMetadata() not being flexible enough (per comments in source code[2]), a possible way of re-arranging things would be to move the current code in image page that turns the array that formatMetadata() returns into a table to a static method of mediaHandler. Then have formatMetadata() normally return an html string, which would normally be created by a method of mediaHandler. If specific mediahandler subclasses need to do some complicated formatting, they can just return the appropriate html itself. To retain backwards compatibility, if the imagepage gets an array from formatMetadata, it sends it to the static method of mediaHandler. That was just an initial thought though.

I was thinking (this is just my initial thoughts) perhaps the metadata table should be like the current table, but have different sections for the different types of metadata.

Metadata
Exif
Some field	some value
foo	bar
XMP
Description	A picture of a butterfly
Description	Somebody

On the other hand, that approach somewhat over-emphasizes the different technical formats, which the average user doesn't care about the different formats. For ogg files, the metadata might be split up by the different streams ("Audio stream <id>", "Video stream <id>" etc). For example, the metadata for File:Wikinews11Apr2005 Demo (high quality).ogg (api output) might look like:

Metadata
Video Stream (1413894367)
Type	Theora
Vendor	Xiph.Org libTheora I 20060526 3 2 0
length	1 minute
Title	Wikinews video broadcast demo <n:en:Wikinews:Broadcast>
Artist	David Vasquez <n:en:user:David Vasquez>
Location	San Jose, California
Organization	Wikinews, the free news source <http://en.wikinews.org/>
Copyright	Public Domain except for the logo is copyright the Wikimedia foundation. See commons:Template:CopyrightByWikimedia
Encoder	ffmpeg2theora 0.18
Audio Stream (633291139)
Type	Vorbis
Vendor	Xiph.Org libVorbis I 20070622
Length	1 minute
Channels	2
Sample rate	48000
Encoder	ffmpeg2theora 0.18

(With some of the more technical fields hidden by default using the js toggle)

If I do have time, I'd imagine a metadata editor implemented as a special page (perhaps an extension) that user can modify/add fields would be quite awesome. On save, it would be roughly equivalent to uploading a new file (like how reverting an image is like uploading a new file), but with a different action in the upload log. If time permits, I would definitely like to work towards a metadata editor. Currently when people download a file, most information on the image description page is lost. If people then find the file on their hard drive six months later, they have no idea where it came from. Imagine if all the information contained in commons:template:description was also in the meta-data. Then when people open the file six months later, their image editing program can inform them the source of the image, the url at commons, its description, etc.