I think I overlook something. I define somewhere in my mediawiki a general ToS, this include a sentence like "All the content of this wiki have the CC-BY-SA license (or whatever), apart from pictures. Pictures have an own license that the uploader of the file want to use." The uploader of a screenshot now choose e.g. GPL. How can I share the whole wikisite (text and pictures) now, without any licence problems?
Talk:Files and licenses concept
We need to stabilize the database scheme as soon as possible. The most annoying requirement for this purpose is #2 (A list of files by copyright holder should be obtainable), but this is too important to be dropped in my opinion.
Options:
- Add an fileprops entry for every revision and join fp_rev_id against page_latest. Requires an index on page_latest, but no schema changes to revision
- Add a boolean fp_latest to fileprops and add rev_fileprops_id to revision, which is a reference to fp_id. fp_latest needs to be included in all indices. If fp_id is taken equal to the revision that changed the fileprops entry, there is no need for an extra index to revision.
- Add a boolean fp_latest to fileprops and add an fileprops entry for every revision. This requires no schema change to existing tables. fp_latest needs to be included in all indices.
- Add img_fileprops_id to image, which is a reference to fp_id, which is the revision id of the revision that changed the fileprops entry. img_fileprops_id needs to be indexed.
I need a bit of thinking before I state my preference, but those 4 are the options available that I see.
Ah I see that I missed the comments on Talk:License_integration_MediaWiki#Fileprops_and_ids.
I have thought about this, and I think #4 is the cleanest way to implement this feature. Using img_fileprops_id a direct link between file props and a file is created, which is in my opinion the most natural thing to do. The disadvantage is that while the current version links to a file, the previous versions link only to revision and thus to a page, which is a bit inconsistent. (We could add an fp_img_name column to solve the inconsistency, but I'm not sure I like that)
#3 is the easiest to implement from an operational perspective. As stated before however, it feels a bit like a hack to have a boolean fp_latest, but perhaps I'm wrong?
I do not have very strong opinions about this, so I'm fine with any of them.
I have the following points on my personal "requirement list" for the file_props identifier:
- There should be a link between mw_revision and mw_file_props in such a way that if I edit a page and click "Permanent link" that that page will keep showing the same fileproperties and wikitext. In order words, a direct link between a revision and the text.oldid and file_props that belongs to it.
- We shouldn't duplicate mw_text or mw_file_props rows if nothing has changed.
I think #2 is a good choise. That will also allow doing queries only searching through current files and their latest version of the fileproperties (JOIN between current revision and the file_props set that belong to it).
I agree that duplicating rows if nothing changed is not a good idea.
You asked me what I had against fp_latest, but other than "I don't like it" I have no arguments against it, so let's not take my opinion with regards to that into account.
A possible other option:
#5: Add rev_fileprops_id to mw_revision, which is a reference to fp_id. fp_id is equal to the revision that changed the fileprops entry, there is no need for an extra index to mw_revision.
To get a list of all files that are currently by a certain author or with a certain license we'd join page_latest with revision and file_props. The advantage is that we don't need an extra "fp_latest", downside is that such a join may not be very fast depending on where the indexes are (We need mw_page to get all files and the current revision id, then from revision the current fp_id).
I think in terms of effeciency #5 is probably not going to make it and thus a quick and fast fp_latest in mw_file_props would help. However let's not forget that there's no direct link from a set of file_props to a page. One would still need a join to page for the titles etc. So perhaps #5 isn't bad after all.
Currently the concept is to store the following per set in mw_file_props:
mw_file_props contains: fp_id INT (mw_revision.rev_fileprops_id is a key to this column) fp_key VARBINARY(255) fp_value_int INT fp_value_text VARBINARY(255)
EXAMPLE
fp_id | fp_key | fp_value_int | fp_value_text |
---|---|---|---|
1 | author | 50 (mw_user.user_id of User:Krinkle) | NULL if empty the username is used |
1 | author | 43 (mw_user.user_id of User:Catrope) | Roan wiki user who wants display name different from username |
1 | author | NULL | John Doe |
1 | license | 2 (mw_license.lic_id of CC-BY-SA-3.0) | |
1 | license | 5 (mw_license.lic_id of GFDL) |
This file has three authors: Krinkle, Catrope (attributed as Roan) and John Doe (not a wiki user). And is dual licensed.
What we totally forgot in this structure is attribution and url. So instead of adding a useless extra column like fp_string2 and fp_string3, it's best to move this away to a mw_author table.
mw_file_props contains: fp_id INT (mw_revision.rev_fileprops_id is a key to this column) fp_key VARBINARY(255) fp_value_int INT mw_author contains: author_id PRIMARY AUTO INCREMENT author_user INT (mw_user.user_id is a key to this column - it's NULL if author is not a wiki user) [1] author_name VARBINARY(255) (if author is wiki user this is NULL, we get username from user table (saves space and saves changes when username would change) [1] author_attribution VARBINARY(255) (if author is wiki user this may be NULL (in which case attribution falls back to 'user_real_name' or 'user_name'; but it may be filled in if users want a custom attribution eg. "Team X / Krinkle") either way this is the field that licenses will cite when requiring attribution (eg. photographer could hand rights to museum author is John Doe, attribution is Museum X) [1] author_url VARBINARY(255) [2] UNIQUE(author_name, author_attribution, author_url)
[1]: Front-end requires an author to be given. atrribution and url are optional. When saving MediaWiki checks if author is a wiki username. If so, it uses it's id as author_user, and NULL for author_name, otherwise it saves the string as author_name and NULL in author_name
[2]: url is optional, but if account is wiki user and no value is here it fallsback to userpage.
EXAMPLE
author_id | author_user | author_name | author_attribution | author_url |
---|---|---|---|---|
12 | 50 (mw_user.user_id of User:Krinkle) | NULL | NULL | NULL |
14 | 43 (mw_user.user_id of User:Catrope) | NULL | Roan wiki user who wants attribution different from username | NULL |
16 | NULL | John Doe | Foobar Museum | http://museumfoobar.org/ |
fp_id | fp_key | fp_value_int |
---|---|---|
1 | author | 12 (reference to row in mw_author) |
1 | author | 14 (reference to row in mw_author) |
1 | author | 16 (reference to row in mw_author) |
1 | license | 2 (mw_license.lic_id of CC-BY-SA-3.0) |
1 | license | 5 (mw_license.lic_id of GFDL) |
This file has three authors: Krinkle, Catrope (attributed as Roan) and John Doe (not a wiki user, hire by museum, hands over rights). And is dual licensed.
Roan didn't see anything bad in it, and the idea came from Bryan. If by tomorrow we dont see anything wrong with this I'd say... let's write it!
Ok with me, but it does mean creating an AuthorManager and the way to link user input on the file props edit form with the database becomes even less trivial. I don't see another option though, so let's do it this way.
An AuthorManager would imho be a feature request for later. For now the author-data is stored in the mw_author table so that we don't duplicate information and have an effecient way to get lists by one or more authors or licenses (since file_props only stores integers to mw_license or mw_author).
And this enables a future ability to edit an author once and have it reflect on all it's files. However I dont think an AuthorManager is required for the first working-proof-of-concept (there's no "user manager" either, although there is RenameUser..) - this AuthorManager would probably be an admin-only function (ie. to merge, or update authors in case something needs to be updated). but not something every uploader could do. (what are the usecases ? 'm open to new insights)
We need structured data for media assets, but as you start to support different data types and relationships I always feel bad that we don't adopt a more general solution.
Additionally, the departure of wikitext based data has some consequences to list a few:
- You need separate api entry points for bots and applications to update and read things
- You need magic non-standard variables calls to access this data within wikitext for formatting and display
- Your wiki-page dumps don't include the data for that asset,
- The data is not easily integrated into templates flows
- the data lacks flexibility and is tied to a given interface and 'input/output' that is not easily generalized.
To me it seems we are so far down the rabbit hole of wikitext encapsulating both data and representation, that it will probably be better to concentrate on a general solution than building lots one-off solutions. If building one off solutions we should ideally try to keep some consistency with other structured data types ( like categories )
That being said there are of course some advantages of moving things out of wikitext and there is no perfect solution. Lacking a general solution, one-off solutions are better than ~no solution ;).
If there are parser functions for getting at the data, templates should be able to do all sorts of fun things. How wise that is, I leave to others. ;)
I've got three main API-related interests:
- Foreign API repo clients need access to this information in imageinfo output so they can retrieve and replicate it.
- Upload tools need a way to query the available license options, and a way to pass the machine-friendly license & other metadata at upload time.
- Client tools may also need a way to edit the machine-readable info later.
Retrieving the info is probably pretty straightforward, but the format needs to be fleshed out and written. Pushing the data back up with an upload or edit request may be a little trickier.
Before we forget the plans from the Hackathon. From what I remember there were a paths we considered an option but we hadn't decided yet:
Scheme ideas
There were two main conceps:
1) mw_revision to mw_file_props
- Link between revisions and file properties:
- mw_revision get's a rev_fileprops_id which connects to mw_file_props.fp_id (just like revision.rev_text connects to mw_text.old_id)
- Identification of sets in mw_file_props:
- mw_file_props has an fp_id column which is "primary" but not unique or auto incrementing integer – since properties are in sets but also revisionised. fp_id is a counter for every 'set' and version of each 'set' (like revision but spanning multiple rows).
- Filepage (view or oldid) view
- MediaWiki looks up the correct entry in revision as usual with page.page_latest or oldid from URL. In there it finds rev_text and rev_file_props and it grabs from both tables the row with that specified id.
- Search in "current" fileprops / eg. List of works by X
-
SELECT * FROM ?? WHERE ????? AND fp_key='author' AND fp_value_text='John Doe'
FIX: @FIXME:
-
- Good:
- Most like the other tables in core. Search is probably good too.
- Bad:
- So far we haven't come up with a decent way to get this count for fp_id which is actually the only problem and if it weren't for that, this was the preferred scheme
- Not sure if the search query is good or not.
2) mw_file_props to mw_revision
- Link between revisions and file properties:
- mw_file_props has an fp_rev_id that is identical to the rev_id in mw_revision
- Identification of sets in mw_file_props:
- mw_file_props has an fp_rev_id column. (see previous point)
- Filepage (view or oldid) view
- MediaWiki looks up the correct entry in revision as usual with page.page_latest or oldif from URL. In there it finds rev_text but not an id in file_props, that one is reversed. It looks up in file_props where fp_rev_id is rev_id.
- Search in "current" fileprops / eg. List of works by X
-
SELECT * FROM ?? WHERE ????? AND fp_key='author' AND fp_value_text='John Doe'
FIX: @FIXME:
-
- Good:
- No schema change of mw_revision required
- Sets are identified with the rev_id when they were created. This allows an easy "File_props last edited by X on Y" seperate from the last modification of text.old_id by checking the rev_user of the rev_id in fp_rev_id.
- Bad:
- If a revision changes the wikitext but not the file properties, we have to, in order to keep the link between 'revision' and file_props' live, either:
- A) Update current set of rows in file_props to the new fp_rev_id.
- Means that fp_rev_id will no longer indicate when the set was last changed. Not a big deal.
- Means that the set rows are touched eventhough nothing is being changed (bot edits will cause changes in file properties when only wikitext is changed)
- B) Duplicate the row in file_props
- Dupes are never good.
- Means that fp_rev_id will no longer indicate when the set was last changed. Not a big deal.
- A) Update current set of rows in file_props to the new fp_rev_id.
- If a revision changes the wikitext but not the file properties, we have to, in order to keep the link between 'revision' and file_props' live, either:
- Krinkle 01:12, 16 January 2011 (UTC)