Extension talk:WikiGrok/Claim suggestions
Discuss stuff here :)
Is it possible to reliably extract date suggestions from articles?
edittl;dr
editAt worst, we can semi-reliably extract a year (of birth/death for people, of formation for bands, and of release for albums) from categories, infoboxes, persondata, and lead sentences, in order of increasing complexity. At best, we can extract the full date from infoboxes, persondata, and lead sentences.
Infoboxes and persondata
editBoth infoboxes and persondata use templates that, typically, have dates confirming to MOS:YEAR. Extracting dates from infoboxes and persondata then involves some "parsing" of the article's source, probably with a regular expression or two.
Categories
editBirth and death categories are a good source of birth/death year suggestions. There are well formed categories for births and deaths in most years from 17th century BC, e.g. Category:1646_BC_deaths to 21st century, e.g. Category:2014_births.
Lead sentences
editThere isn't any requirement that birth and death dates in lead sentences follow a format or are in a specific position. Biographical articles tend to have the dates following the subject's name, possibly separated by an em dash. Like infoboxes and persondata, extracting dates from lead sentences will require a handful of regular expressions that match a variety of date formats and positions. This shouldn't be a huge performance hit because we're only ever trying to match one sentence.
Generating a corpus of suggestions
editThere are three approaches to ensuring that WikiGrok has a good corpus of suggestions, which can all share the same extraction code:
- Bulk extract suggestions from articles for eligible items
- Extract suggestions when the article is saved by listening to the PageContentSaveComplete or CategoryAfterPageAdded hooks
- Extract suggestions when suggestions are requested during a WikiGrok game
2 and 3 are complementary. If we only used approach 2, then it'd take a non-trivial amount of time to generate a good corpus of suggestions.
Cheap suggestions
editWe currently have 1 "cheap claim suggestion" (that only requires looking at the existing Wikidata claims for that item): Extension:MobileFrontend/WikiGrok/Claim suggestions#Also an author?. Here are some other ideas:
Country of origin is US, but original language not set as English: 14549 pages
http://wdq.wmflabs.org/api?q=claim%5B495:30%5D%20AND%20noclaim%5B364:1860%5D
Country of origin is France, but original language is not set to French: 13407 pages
http://wdq.wmflabs.org/api?q=claim%5B495:142%5D%20AND%20noclaim%5B364:150%5D
Country of origin is UK, but original language not set as English: 4998 pages
http://wdq.wmflabs.org/api?q=claim%5B495:145%5D%20AND%20noclaim%5B364:1860%5D
Country of origin is Germany, but original language is not set to German: 854 pages
http://wdq.wmflabs.org/api?q=claim%5B495:183%5D%20AND%20noclaim%5B364:188%5D
Politician (P106:Q82955), country of citizenship US (P27:Q30), no member of political party set (P102) [We can ask Democrat or Republican randomly]: ~5800 pages
If someone is a painter we can ask which genres they painted in (5 choices: history painting, portraits, genre painting, landscapes, still life). ~13661 pages
- A few other cheap claim suggestions:
- Adding language to an episode: item is an episode (Q1983062) but does not have language (P407) added. Suggest some (or all) languages the article is available in.: 11216 pages
- http://wdq.wmflabs.org/api?q=claim[31:1983062]%20AND%20noclaim[407]
- Adding original language to a book: item is some sort of book (Q571) but does not have original language of this work (P364) added. Suggest some (or all) languages the article is available in.: 45591 pages
- http://wdq.wmflabs.org/api?q=claim[31:(tree[571][][279])]%20AND%20noclaim[364]
- Adding orignal language to a creative work: item is some sort of work (Q386724), has a country of origin (P495) but does not have original language of this work (P364) added. A mapping table with language to suggest depending on the country of origin need to be created: 51936 pages
- http://wdq.wmflabs.org/api?q=claim[31:(tree[386724][][279])]%20AND%20claim[495]%20and%20noclaim[364]
Expensive suggestions
editFor expensive claims, we will need to pre-generate data and then update that data as needed. One strategy for doing that would be as follows:
Create a maintenance script that runs through all expensive campaigns for all articles. This could potentially be run once a month as a job for the job queue. (It could also be assigned to its own special queue similar to Parsoid jobs.) To prevent us from asking questions that have already been answered, we could run a check on links update (which is triggered when an associated Wikidata item is updated). The check would use the Simple Query data to see if the item already had the appropriate properties set for any of the campaigns. If so, it would remove the entries from the wikigrok_questions table.
We may also want to limit the number of page links that are examined for LinkedProp Campaigns. For example, only looking at the first 10 or 20 links. Kaldari (talk) 00:01, 18 November 2014 (UTC)