Core Platform Team/Initiatives/Hash Checking

Initiative Vision

< Initiatives

Vision:
  • effective detection and takedown of terrorism and child protection multimedia content across the Wikimedia ecosystem
Target Group(s):
  • members of the Wikimedia movement
  • members of the Trust and Safety team
Needs:
  • prevent the exposure to and exploitation of terrorism and child protection multimedia content across the Wikimedia ecosystem
Product:
  • automated capability to detect terrorism and child protection multimedia content on upload and during scan
  • notification to Trust and Safety of detection of terrorism and child protection multimedia content
  • automated takedown of child protection multimedia content
Aligned Goals:
  • Build the capabilities and capacity to effectively address harassment challenges in our online communities (Anti-Harassment Program MTP-Y1). (David Rochford)

Initiative Description

< Initiatives

Summary

The program scope is to prevent the exposure to and exploitation of terrorism and child protection multimedia content by developing and deploying hash checking functionality to automate the detection and takedown of terrorism and child protection multimedia content across the Wikimedia ecosystem.

Significance and Motivation

Reacting to societal and regulatory pressures, big for-profit platforms have built two shared open hash corpuses to facilitate more effective platform-wide and cross-platform policing of the most problematic types of content: child protection and terrorism. These tools have established themselves as the industry base standard ever since Microsoft released PhotoDNA for child protection in 2014, now also used by Adobe, Facebook, Google, Twitter, and the National Center for Missing & Exploited Children. Building on the system in late 2016, Facebook, Microsoft, Twitter, and Youtube expanded the shared ISP hash corpus to tackle terrorism challenges.

Outcomes
  • automated capability to detect terrorism and child protection multimedia content on upload and during scan
  • notification to Trust and Safety of detection of terrorism and child protection multimedia content
  • automated takedown of child protection multimedia content
Baseline Metrics

detection and takedown of terrorism and child protection multimedia content is a purely manual process

Target Metrics

automated detection and takedown of terrorism and child protection multimedia content

Stakeholders
  • members of the Wikimedia movement
  • members of the Trust and Safety team
Known Dependencies/Blockers

None given

Epics, User Stories, and Requirements

< Initiatives

Epic 1: Minimum Viable Product

Personas in this epic
  • Reader: a reader of Wikimedia content
  • T&S team member: a member of the Trust & Safety Team responsible for assessment of child protection and terrorism concerns
  • Developer: a MediaWiki core or extension developer
User Stories
ID User Story Priority Notes
1 As a Reader of Wikimedia content, I want to limit my exposure to child protection or terrorism content Must Have
2 As a T&S team member, I want to be notified by email when multimedia files containing child protection or terrorism content are uploaded to Wikimedia wikis Must Have
3 As a T&S team member, I want to be notified by email when periodic scans of previously uploaded multimedia content detect files containing child protection or terrorism content Optional
4 As a T&S team member, I want child protection multimedia content to be removed automatically before I am forced to view it Must Have Terrorism content will not trigger an automatic takedown. There should be a flag that can be set to determine if the tool is automatically taking down content or flagging so testing can be done of both methods.
5 As a developer, I want to be able to request asynchronous analysis of multimedia files to determine if it contains child protection or terrorism content Must Have An extension would be created to process analysis requests using the job queue. The extension would use PhotoDNA to compare the files with hashes in the industry-wide data bases for child protection and terrorism content. The extension would pull directly from the hash corpus using the API, if feasible.

Open Questions

< Initiatives

  1. The PhotoDNA documentation mentions a maximum rate limit of 5 requests per second. Is that of concern?
  2. The PhotoDNA documentation also mentions a PhotoDNA high volume tier for customers sending more than 10 million transactions per month. Is that a tier that you feel we will eventually fall into and, if so, has anybody explored this option?
  3. The current plan is to queue a request for hash checking an image after the image has been uploaded. Images flagged as of concern for child protection would be deleted.
    • Is it of concern that these images could then be undeleted by a user?
    • If it turns out that the speed of hash checking is minimal (which it may not be), would it be preferable to prevent the images from being uploaded? That is, rather than upload then delete, the hash check would be done synchronously during upload, potentially preventing the image from ever being uploaded. It is likely that the performance would not be sufficient for this, but we are trying to assess what the optimal behavior would be.

Documentation Links

< Initiatives

Phabricator

T245595

Plans/RFCs

None given

Other Documents

None given

Subpages