Wikimedia Research/Showcase/Archive/2020/03

March 2020

Theme: Topic modeling

March 18, 2020 Video: YouTube

Big Data Analysis with Topic Models: Evaluation, Interaction, and Multilingual Extensions

By Jordan Boyd-Graber, University of Maryland

A common information need is to understand large, unstructured datasets: millions of e-mails during e-discovery, a decade worth of science correspondence, or a day's tweets. In the last decade, topic models have become a common tool for navigating such datasets even across languages. This talk investigates the foundational research that allows successful tools for these data exploration tasks: how to know when you have an effective model of the dataset; how to correct bad models; how to measure topic model effectiveness; and how to detect framing and spin using these techniques. After introducing topic models, I argue why traditional measures of topic model quality---borrowed from machine learning---are inconsistent with how topic models are actually used. In response, I describe interactive topic modeling, a technique that enables users to impart their insights and preferences to models in a principled, interactive way. I will then address measuring topic model effectiveness in real-world tasks.

Topic Classification for Wikipedia

By Isaac Johnson, Wikimedia Foundation

This talk will provide a high-level overview of how the Wikimedia Foundation is approaching the challenges of topic classification and topic modeling for Wikipedia. An overview will be given of the importance of being able to model topics to Wikipedia readers and editors as well as a description of some of the existing technologies (ORES articletopic API; Wikidata-based topic API) and future work in this space. (Presentation slides)