Help:Extension:GWToolset
This page is obsolete. It is being retained for archival purposes. It may document extensions or features that are obsolete and/or no longer supported. Do not rely on the information here being up-to-date. |
GWToolset (or GLAMWikiToolset) is a Special Page extension. The main goal of the extension is to allow GLAMs the ability to mass upload content (pictures, videos, and sounds) to Wikimedia Commons based on respective metadata (XML); the intent is to allow for a wide variety of XML schemas. The extension goes about this task by presenting the user with several steps, represented by HTML forms, in order to set-up a batch upload process that will upload content and metadata to the wiki, which creates individual mediafile pages for each item uploaded.
The project was co-funded by Europeana and a few Wikimedia chapters[1].
Further information can be found on the project page. Your feedback and questions are welcome, feel free to contact us.
Introduction
editYou’re probably reading this because you are considering or planning to make a large amount of content available for reuse by publishing it on Wikimedia Commons. This manual will guide you through the necessary steps.
Process overview
editThe image below is a process diagram that gives an overview of the steps to use the toolset. This manual is structured according to this process diagram.
Phase 1: Preparation
editUsername & user rights
editTo use the toolset you need to:
- Be a registered user.
- Be granted access rights to the toolset.
Registering a user name
editYou can skip this step if you already are a user of Wikimedia Commons, Wikipedia or any other Wikimedia project. Please follow these steps if you aren't a user or if you want to set up a specific account for content uploads:
- Read these guidelines for choosing a username.
- Go to the signup page to register.
Introduce yourself on your user page
editAfter signing up you can log in. You'll see your username in red on the top of the wiki page.
Asking for user rights
editWe recommend that you do all testing on the Commons Beta server and only once you feel that the tool is giving you the results you want, use it on the Production server. Because these are two separate environments, you will need to have a user account on each and request access on each. The best way to do this:
- Commons Beta server - contact a developer or bureaucrat on beta to request the rights for the GWToolset user group on beta. You can ask in #wikimedia-commons connect, glam mailing list, or contact them from these lists:
- Commons Production server - Once you have a successful example upload to demonstrate from the Beta server, go over to "real" Commons and leave a message on the Commons Bureaucrats notice board to request rights for the GWtoolset. Please introduce yourself and motivate your request.
GWToolset rights are granted for one year at a time on the Wikimedia Commons production server, and expire automatically. Users are notified in advance of expiry and can request extensions or restoration at the Commons Bureaucrats notice board if they have ongoing plans to use the toolset.
Domain Whitelist
editIf your media file domain is not yet whitelisted (look for "wgCopyUploadsDomains"), please request that your media file domain be added to the Wikimedia Commons domain whitelist. The domain whitelist is a list of domains Wikimedia Commons checks against before fetching media files. If your media file domain is not on that list, Wikimedia Commons will not download media files from that domain. The best example, to submit in your request, is an actual link to a media file.
Please note that requests will take several business days. If you are planning some sort of event or training program, it is recommended you make requests one week in advanced to be on the safe side. If your request must be fulfilled by some date in order to be ready for a planned event, please include what date/time you need it by in the request title, and we will try to make sure the request is fulfilled before then. Sometimes people will ask for clarification about whitelist requests, so be sure to respond to any questions. Although not required, including who you are, and what you plan to upload can make the process go smoother.
Content selection
editThere are several variables to take into account when selecting content. First of all are there restrictions to the content - like file formats, copyright restrictions, organisational restrictions, etc. - that determine if a work can be published on commons. These variables also determine if a content upload can be done in one batch or if it is better to separate the content into separate batches.
Another factor is your content sharing strategy. How and when are you going to publish your content? In large batches? Small themed batches?
Content types
editEvery type of content needs a different metadata template. It is not possible to upload photos and sound files in one batch, these need to be separated in a batch of photos and a batch of sound files.
License Types
editIt is not possible to upload content with different licenses in one batch. Let's say you want to upload files that are available under a CC BY and files with a CC BY-SA license, then you'd have to separate the uploads in a batch for every license.
Example: Recent photographs of the collection of the University Museum in Utrecht. |
---|
The University Museum Utrecht commissioned a local photographer to take pictures of instruments, stuffed animals, skulls, etc. They sent a permission notice to OTRS and received a ticket number that has to be mentioned on the pages of every photo that was taken by this photographer. To be able to do so they had to upload these photos in a separate batch from their other content (pictures that were considered public domain). |
Permissions
editContent that was created after 1923 probably needs a notice that you have permission from the creator to release these files under one of the accepted licenses for Wikimedia Commons. It is not possible to upload files of different creators in one batch because you need an OTRS ticket number for every creator.
Content sharing strategies
editThere have been several large content donations already. All of these were mass donations: one single event where all the content was uploaded to Wikimedia Commons. This is not the only way to do a donation. This chapter discusses different strategies for content donations.
One time mass sharing
editThis is the classic way of sharing content: a large scale upload of the content that can be selected with the available sources.
Advantages:
Theme based
editSome GLAMs are currently considering theme based uploads. A theme can be an exhibition. This means that selecting the content that will be uploaded to commons can become a part of the process of preparing an exhibition.
Advantages:
- Ongoing process of uploads, every new upload gains interest
- Lessons learned from past upload can be
Advantages
Low hanging fruit
editStart small, big end possible
editTechnical compatibility analyses
editThe Toolset has been developed to be used by the most common way GLAMs have organised their content. This means that the Toolset is easy to work with for most organisations, but that some will have to take extra measures before they can use it. The diagram in this paragraph can be used to determine how compatible the Toolset is for your organisation. Every question in the diagram is explained underneath.
Are the media files online available?
editOnly files accessible from the internet can be uploaded using GWToolset. If you have a very large amount of images (hundreds of GBs or more), it is possible to arrange the files to be uploaded by mailing a hard disk. The procedures for processing metadata on such files is very different than those for GWToolset. For more information on this option, please see commons:Help:Server-side_upload#What to do if files represent hundred of GB to several TB?.
Can the media files be put online?
editIf they can be, then you need to do this to use GWToolset
Is the metadata online available?
editThe metadata does not need to be online. The metadata just needs to be converted to a single XML file in a "flat" format.
Can the metadata be exported?
editThe metadata needs to be converted to a flat XML format.
Can the metadata be exported to XML?
editThe metadata needs to be converted to a flat XML format. If you have trouble converting to XML, there are volunteers who can probably help you. Contact the glam mailing list.
Are the mediafiles and metadata both publicly available?
editOnly the media files need to be publicly available.
Are credentials available to gain access to the mediafiles and metadata?
editThe media files cannot be behind a password. They must be directly accessible from a URL. The metadata doesn't need to be publicly accessible
Can these credentials be used to access the mediafiles and metadata?
editThe media files cannot be behind a password. They must be directly accessible from a URL. The metadata doesn't need to be publicly accessible
Is there an API available?
editAPIs can be useful for generating the metadata file, but not required.
Does the API respond in XML?
editAPIs can be useful for generating the metadata file, but not required.
Is the XML in flat format?
editThere are several standards that are currently used by organisations to organise their metadata, for example OAI-PMH, EDM, MARC and Lido. The GLAMwiki Toolset accepts all forms of metadata as long as the data complies to the following requirements.
What is flat format?
editThe metadata of individual objects have to be on the same level of hierarchy in the XML file, that's what 'flat' refers to. Metadata in a deeper level, further in the hierarchy, is not recognised by the Toolset.
'Flat' XML | Non-'Flat' XML |
---|---|
An example of a flat XML file | An example of an XML file with a deeper hierarchy |
<?xml version="1.0" encoding="UTF-8"?>
<metadata xmlns:dc="http://purl.org/dc/elements/1.1/">
</metadata> |
<?xml version="1.0" encoding="UTF-8"?>
<metadata xmlns:dc="http://purl.org/dc/elements/1.1/">
</metadata> |
The metadata in the field author, subject and rights will be recognised. | The metadata in the deeper levels will not be recognised. |
The use of attributes
editAttributes of declarations are also not recognised with one exception: the language attribute. This attribute can be used to recognise the descriptions of objects in different languages.
For example,
- <dc:description lang="en">This is a description</dc:description>
is recognised as a description in English.
- <dc:source photoid="351131">www.example.org</dc:source>
Will be seen by the Toolset as
- <dc:source>www.example.org</dc:source>
The PhotoID in this example will not be read.
Information in attributes can cause loss of information.
Multiple descriptions in one metadata field
editSome metadata fields are mentioned more than once, for example <dc:subject>
.
Currently there is no option to include this individually but the data in these fields will be merged, separated with a pipe symbol (|
).
In some cases an object has several descriptions, like "vehicle", "flamethrower" and "combat vehicle".
All of these descriptions will be added to the object when they are included in the XML with the <dc:subject>
field.
It is advised to separate metadata fields as much as possible, this way they will be shown on commons in the right way.
Wrong | Right |
---|---|
|
|
Can the XML be transformed in flat format?
editDo you need help to convert your XML to a 'flat' XML file? Then consider these options:
- Hire a specialist to write a script to convert your XML file
- Use XSLT: http://www.w3.org/Style/XSL/
- Choose a standard that publishes the XML as a 'flat' file, like OAI-PMH and – to a certain extent – the Europeana API
- Look into Open Refine
Selecting Metadata
editWhat metadata can Wikimedia handle?
editAbout Dublin Core and Europeana
editCustom fields
editTemplates
editMetadata templates
editWikimedia Commons uses templates to map metadata. The amount of metadata that will be displayed on Commons is therefore limited to the fields that are present in the metadata template that is chosen for the upload.
There are several templates available. Some of the templates that are available are:
- Art_Photo: https://commons.wikimedia.org/wiki/Template:Art_Photo
- Artwork: https://commons.wikimedia.org/wiki/Template:Artwork
- Book https://commons.wikimedia.org/wiki/Template:Book
- Musical work https://commons.wikimedia.org/wiki/Template:Musical_work
- Map https://commons.wikimedia.org/wiki/Template:Map
- Photograph https://commons.wikimedia.org/wiki/Template:Photograph
- Specimen https://commons.wikimedia.org/wiki/Template:Specimen
There is currently no template available for video content. It's not possible (yet) to use a template you created yourself.
The type of work that you want to upload determines the template you ought to use. This also means that it is not possible to upload multiple types of content that require different templates. E.g.: if you want to upload photos and sound files you should separate these uploads and XML files in an upload (and XML file) of the photos and an upload (and XML file) of the sound files. It is not possible to upload both file types in one batch.
License template and other metadata sub-templates
editSome metadata fields also use templates. An example is the metadata field for the license of a mediafile. A Creative Commons license will be recognised by the Toolset and results in the display of the corresponding banner with the license. It is possible to create your own template. This is useful when you've cleared permission to use the content and received an OTRS ticket to include with the files. See this example of an OTRS ticket in a license template. If the text in the license field is not referring to a template, this information will be shown as plain text.
Note: the Wikimedia Commons community is very strict when it comes to permission of files usage. The content is most likely deleted when there is any doubt about copyright infringement or other restrictions that do not permit the use of the file on the Wikimedia platforms. This is why a good license template is an absolute must.
Institution Template
editAn institution template is used to show what institution provided and/or uploaded the file to commons. The template makes it possible to add more information about your institution than only the name of the institution. An example of an institutional template is this template of the Amsterdam Museum. Useful information to include in this template is:
- The logo of your organisation
- A photo of the building of your organisation
- The location (City, country, etc)
- The coordinates
- The URL to your website
This template is not required, but highly recommended to include with your uploads.
An institution template will be recognised by the Toolset. The template mentioned above will be included by the Toolset if the source tag in the XML file has the same name as the template, in this case: <dc:source>Amsterdam Museum</dc:source>.
Source template
editCategories
editCategories are special pages to group related pages and media. It is essential that every file can be found by browsing the category structure. To allow this, each file must be put into a category directly. Each category should itself be in more general categories, forming a hierarchical structure. The category structure is the primary way to organize and find files on the Commons. It is essential that every file can be found by browsing the category structure. With the GLAMwiki Toolset you can add your content to existing or new categories.
Categories can be in multiple languages. Make sure that, next to your own language, you also search for and add English categories to your content.
Check available categories
editPlease see this quick guide to learn how you can search for existing categories.
Create categories
editIf you need to make new categories, please read the policy on categories on Commons.
Validating your xml
editYou can validate your xml file by using the form at http://www.w3schools.com/xml/xml_validator.asp.
Common xml problems
editAmpersand and less than ( & < > )
editUse of "&" within fields in your xml file can cause unpredictable results. These may be interpreted (correctly) as XML encodings of characters, for example "&" will display as "&" on a Commons image page. Floating ampersands in your text like " & " or text that looks like an html encoding but may be abbreviations in English like "&c." for etcetera, are likely to cause the GWT to fail at that record. It is worth searching out and replacing these with "and" or similar, depending on the templates these are going to be used within.
xml relies on < and > to wrap fields. If you are using these in your text you should convert them to "<" and ">" or standard brackets to avoid your xml being misinterpreted.
Please note, that since it is an XML file, not an HTML file, HTML named entity references like using é
for é will not work.
You must either use the normal UTF-8 symbol directly, or a numeric entity reference such as é
, or é
.
Double-dash ( -- )
editThe use of double dashes may be unpredictable as these can be interpreted as part of xml comment fields. These are unlikely to be an issue in most cases, but worth changing to single dashes in title fields.
Equals, pipe symbol, question mark, forward slash ( = | ? / )
editThere are a number of characters that are either not allowed in Commons file names or may (or may not) give problems when used in some templates. For example, to use an equals sign in some templates, you would have to wrap them in double curly brackets, i.e. "=" becomes "{{=}}". It is worth testing out an example in a sandbox if you are going to have to use these in url references, or checking for these if your upload unexpectedly halts.
Bad characters
editThe xml file read by the GWT is expected to be in UTF-8 character standard format. Most text editors can handle these, but if you are exporting and importing your metadata these may get oddly converted along the way and show in your uploads as invisible or strangely displayed characters. Standard free editors like the open source JEdit or Google spreadsheets have been used to create useable xml files. Ensure that your process for exporting and editing your metadata provides valid UTF-8 or the simpler ASCII standard output on a small sample, before running your whole batch.
Phase 2, Do a test upload
editScreencast
editThe following screencast gives you a quick overview of how to use the extension. You can follow along by going to Special:GWToolset and following the wizard instructions. Note: you will need to be a member of the “gwtoolset” group in order to use the extension. Contact a Wikimedia Commons bureaucrat to be added to the group.
Screenshots
editMetadata Detection
editMetadata Mapping
editMetadata Categories
editBatch Preview
editPhase 3: Revision of the test upload
editPhase 4: Upload to Wikimedia commons
editTracking the batch upload
editThe wiki page, Special:Log
, can be used to track activity within a wiki.
Some processes have their own page that tracks their specific events; GWToolset is one of them.
You can find the Special:Log pages for GWToolset at the following URLs, which should help you track down the progress or any issues with your batch upload.
commons production
https://commons.wikimedia.org/w/index.php?title=Special:Log&type=gwtoolset
commons beta
https://commons.wikimedia.beta.wmflabs.org/w/index.php?title=Special:Log&type=gwtoolset