User:OrenBochman/Search/Analytics

Analytic For Search edit

Search engine queries provides a direct view into the want and needs of the wikipedia's readers. The second such point of view is the edits - done by the editors. However edits may be spread over time and rarely provide a succinct expression of such needs. So it is at the point of search that vital questions can be answered.

For this projects analytics can be broken into two types of zones:

  • Improving search experience.
  • Exposing interesting and useful information on Media-wikis which would not be available elsewhere.
    • This information could be wrapped into a service and integrated into the wiki's API or as an extension.

Improving Search edit

  • Search click through rate & 0 results -- How often does search fail the user's needs?
  • Top queries -- What are the most commonly requested articles?
  • Slow Queries -- What queries require most time for a response?
  • User Click Ranking -- How often is the top result clicked on? How deep into the results do users browse?
  • Top Facets - When navigating via facets, what are the most used facets. (Prioritize facets pre-calcualtion)
  • Final Destination - How does search result correlate with articles browsed in a user's click stream.
  • Disambiguation bounce - do people like disambiguation pages or prefer for search to match the most relevant page.
  • Can queries be profiled and clustered to provide more intelligent responses
  • How can a user's search & browsing history be used to serve them with more satisfying results.

Content Analytics edit

The main gist of these analytic tools is to model certain aspect of a wiki, identify issues and provide insight into resolving these.

  • Content Signatures.
    • For a given page does it have a fingerprint of unique phrases which differentiate it from all others.
    • A weaker fingerprint aggregating the most distinct phrases (rarest shared phrases according to a scaling law).
    • For a given category is there such a signature of unique phrases which differentiate it from all others.
  • Information Content.
    • How notable is a given page - objectively how much new information does a page contribute.
    • Do large collection of pages duplicate 75% of their content? (How about 95%)
    • Compared with a gold standard of the top rated articles, how do given article compare.
    • What would be a current suggestion of a collection of articles for offline educational wiki based on a historically created one.
      • Are they long, short, interesting, boring.
  • The Missing Link
    • How useful (Relevant, Interesting, Arbitrary) are the links in a given article.
    • What would be the best replacements of the worst links? (Normalized for search queries; browsing histories; based on interwiki convergence)?
    • External Links
      • Are external links accessible to general readers (v.s. paid registration).
      • Are they stable overtime?
      • Are they from reliable sources (Alexa/Certificates/External Site Directory)
  • Writing Style
    • Is a paragraph well structured semantically. To what extent do the first and last sentence summarize a paragraph's body
  • The Element Of Time
    • Hype - What articles remain notable over time? Which future Game/Movie article are mostly hype.
  • Digital Rights
    • Are the most read articles plagiarized in sites without attribution?
    • If we change articles in human intangible ways - how will these propagate into plagiarized space.
  • Vandalism
    • What features make an edit most likely legitimate and to what degree of confidence?
    • What features make an edit most likely illegitimate to what degree of confidence?
    • Based on a time regression model - what is the expected interval until a given page will be vandalized?
    • Can the model provide advice to reduce this time?
  • Human Element
    • Editors
      • What authors are most reliable
      • To what extent is this true over time, subject,
    • Admins
      • How effective are users/admin at alienating new user
      • To what level are votes disruptive to a wiki's day to day business.