ORES/RCFilters

Vocabulary

Model: software that predicts a certain attribute of an edit or page. For RCFilters, we use the damaging and goodfaith models, which predict how likely an edit is to be damaging or made in good faith.
Outcome: possible values for the attribute that the model predicts. For damaging and goodfaith, the only outcomes are true (is damaging / in good faith) and false (is not damaging / not in good faith). Some other models have more than two outcomes (e.g. articlequality), but RCFilters only uses true/false models right now.
Score: a number between 0 and 1 returned by a model. The higher the score, the higher the likelihood of a true outcome. But this does not necessarily correspond to a probability! If edit A has a damaging score of 0.9 and edit B has a damaging score of 0.3, that means the model thinks A is more likely to be damaging than B, but it doesn't mean that there's a 90% chance that A is damaging. It doesn't even have to mean that A is more likely to be damaging than not damaging.
Filter: a feature in RCFilters that lets you display only those edits that match certain criteria. The ORES integration in RCFilters typically provides the following filters:
- For damaging ("contribution quality"): very likely good (likelygood), may have problems (maybebad), likely have problems (likelybad), very likely have problems (verylikelybad)
- For goodfaith ("user intent"): very likely good faith (likelygood), may be bad faith (maybebad), likely bad faith (likelybad), very likely bad faith (verylikelybad)
- Note that for goodfaith, the likelygood filter looks for a true outcome (high scores) and the *bad filters look for a false outcome (low scores), but for damaging this is reversed (because there, true outcomes are "bad")
Threshold: A cutoff value for scores. Filters are implemented as score >= T (when looking for true outcomes) or score <= 1-T (when looking for false outcomes), and the number T is called the threshold.
- Note how thresholds are reversed for false outcomes: when the ORES API reports a 0.123 threshold for a false outcome, that means score <= 0.877. This is a bit confusing, but this definition has advantages, like higher thresholds always corresponding to narrower score ranges. Another way you can think of it is that the false model is a mirror image of the true model, with falseScore = 1 - trueScore, and you're working with thresholds on falseScore (since trueScore <= 1-T is equivalent to falseScore >= T).
Filter rate: The percentage of edits, out of all edits, that are not returned by the filter.
Precision: The expected percentage of results of a filter that truly match the filter. For example, when we say "the precision of the verylikelygood filter is 95%", that means that we expect 95% of the edits returned by the verylikelygood filter to actually be good (and the other 5% to be false positives).
Recall: The expected percentage of the truly matching population that is returned by a filter. For example, when we say "the recall of the likelybad filter is 30%", that means that of all the bad edits that exist, 30% are found by the likelybad filter.
Precision/recall at threshold: When we say "the precision at threshold 0.687 is 60%", that means that we expect 60% of the edits with score >= 0.687 to be true positives (and 40% to be false positives). When we say "the recall at threshold 0.123 is 80%", that means we expect 80% of all edits that are truly damaging/goodfaith to have scores above 0.123.

More about precision and recall

For more about the definitions of precision and recall, see the Wikipedia article on the subject.

Precision and recall are a trade-off: at lower thresholds (wider score ranges), recall will be high but precision will be low, and as the threshold is increased (score range is narrowed), precision will increase but recall will decrease. At the extremes, at threshold 0 (score >= 0, so all scores) recall will by definition be 100% but precision will be low, and at threshold 1 (only edits with the highest possible score) precision will generally be 100% but recall will be low. The increase in precision and decrease in recall aren't monotonic: it's possible for a small increase in the threshold to cause a decrease in precision. The ORES UI tool lets you graph the precision/recall curve for a model (with precision and recall on the Y axis, and threshold on the X axis). by selecting "graph threshold statistics".

Precision-recall curve for dewiki damaging=true
Precision-recall curve for dewiki damaging=false

When we say that precision or recall are "low", that's relative. In the damaging model for instance only a small portion (say 4%) are damaging, and the overwhelming majority are non-damaging. That means that for damaging=true, the precision at threshold 0 is 4%, but for damaging=false the precision at threshold 0 is 96%(!). This means that increasing the threshold increases the precision some, but not much (because it doesn't have much room to grow), and it's why the likelygood filters tend to have such high precision stats (99.5% is standard, and 99.7% or 99.8% are sometimes used).

Getting precision and recall data from the ORES API

The ORES API lets you request statistics (including precision and recall) for every threshold; this is used to generate the graphs above. The API also lets you ask for the best threshold that meets a certain minimum precision or recall. Requesting statistics.thresholds.true."maximum recall @ precision >= 0.6" gets us a threshold where the precision is near 60%^[1]; the response tells us that this is at threshold 0.754, where the precision is 60.1% and the recall is 18.5%:

{
  "itwiki": {
    "models": {
      "damaging": {
        "statistics": {
          "thresholds": {
            "true": [
              {
                "!f1": 0.981,
                "!precision": 0.968,
                "!recall": 0.995,
                "accuracy": 0.964,
                "f1": 0.282,
                "filter_rate": 0.988,
                "fpr": 0.005,
                "match_rate": 0.012,
                "precision": 0.601,
                "recall": 0.185,
                "threshold": 0.754
              }
            ]
          }
        }
      }
    }
  }
}

Similarly, we can request statistics.thresholds.true."maximum filter_rate @ recall >= 0.9" and find the best threshold with a recall of at least 90%. Note that these queries may return null if the requested precision or recall is unsatisfiable.

Threshold configuration in the ORES extension

The threshold for each filter can be configured as a number, but we always configure it as a condition based on precision or recall, as above. This makes the configuration resilient to small changes in the models, makes it easier to reuse default values between models, and makes the configuration more comprehensible (because it's based on precision/recall values, rather than meaningless threshold values). The default configuration is:

"wgOresFiltersThresholds": {
    "damaging": {
        "likelygood": { "min": 0, "max": "maximum recall @ precision >= 0.995" },
        "maybebad": { "min": "maximum filter_rate @ recall >= 0.9", "max": 1 },
        "likelybad": { "min": "maximum recall @ precision >= 0.6", "max": 1 },
        "verylikelybad": { "min": "maximum recall @ precision >= 0.9", "max": 1 }
    },
    "goodfaith": {
        "likelygood": { "min": "maximum recall @ precision >= 0.995", "max": 1 },
        "maybebad": { "min": 0, "max": "maximum filter_rate @ recall >= 0.9" },
        "likelybad": { "min": 0, "max": "maximum recall @ precision >= 0.6" },
        "verylikelybad": false
    }
}

This configuration technically specifies a score range, rather than a threshold; the code interpreting this config derives the outcome (true or false) based on whether "min": 0 (false) or "max": 1 (true) is set. In the future, we may want to change this format to one that specifies threshold+outcome pairs instead of min+max pairs, to align it more closely to ORES's view of the world (right now it's optimized for database queries instead).

Note that the defaults are the same between damaging and goodfaith (other than that true and false are reversed), but verylikelybad is disabled for goodfaith. We may want to reevaluate this inconsistency in the future.

Disabling a filter

As the default example shows, a filter can be disabled by setting its value to false.

Pitfall: if you disable the damaging likelybad filter, you will have to change $wmgOresDefaultSensitivityLevel as well. This setting controls the default values of the oresDamagingPref and rcOresDamagingPref preferences. The values it takes are weird, ancient aliases for the filter names:

`$wmgOresDefaultSensitivityLevel`	Filter name
`hard`	`maybebad`
`soft` (default)	`likelybad`
`softest`	`verylikelybad`

If $wmgOresDefaultSensitivityLevel refers to a filter that is disabled, you will get lots of errors (see T165011), so you have to make sure that the filter it refers to exists. In practice this is only an issue if you disable the damaging likelybad filter, which is uncommon. If you do so, change this setting to 'hard', unless damaging maybebad doesn't exist either (even rarer), in which case you should set it to 'softest'.

Disabling a model

Sometimes, you may want to deploy only the damaging model and not deploy the goodfaith model; this usually happens if the goodfaith model is unusably bad, but the damaging model is usable. To that, you have to change $wgOresModels for that wiki, and because our config is annoying you'll have to duplicate the wiki's entire OresModels config and change goodfaith.enabled to false (example).

Choosing thresholds for a new model

See also ORES/Thresholds

When enabling ORES for RCFilters on a new wiki, a suitable threshold configuration needs to be written for that wiki's damaging and goodfaith models. The defaults are designed to work most of the time, but models vary in properties and in quality. For example, the eswikibooks goodfaith=true model can't do 99.5% precision, so we configured it for 99% precision instead for likelygood. The srwiki damaging=true model can do 90% precision, but with a very low recall (1.6%), so we configured it to use 75% precision instead (where recall is 14.8%) for verylikelybad, and moved the likelybad filter down from 60% to 45% to match.

Selecting these thresholds is not an exact science. Generally, we use this jsfiddle to generate a table of precision/recall pairs for frequently-used settings, look at how well the defaults work, and decide where to deviate from the defaults. Here are some rules of thumb for choosing filter thresholds:

We generally avoid using thresholds at the edges, meaning thresholds where recall=1, precision=1, or the threshold itself is 0 or 1.
For maybebad, we mainly care about high recall, and low precision is OK (but not too low). The default is recall>=0.9, but if that yields a precision below 0.15, we use precision>=0.15 instead. (This is equivalent to using the higher threshold / narrower score range of the two.)
For verylikelybad, we mainly care about high precision, and low recall is OK (but not too low). The default is precision>=0.9, but if that yields a recall below 0.1, we'll use a lower level, often precision>=0.8 or precision>=0.75. Sometimes, if the recall is very high, we might move up to precision >= 0.95.
For likelybad, we aim for moderately high precision with moderate recall. The default is precision>=0.6, and we generally only adjust it if the recall is too low (below 0.2), or if verylikelybad was moved down. For example, if verylikelybad is moved down to precision>=0.75, we typically move likelybad to precision>=0.45.
For likelygood, we aim for high precision (relative to an already-high baseline rate, which is often as high as 98%). The default is precision >= 0.995, and we adjust it downwards (usually to 0.99) if the model can't achieve that precision (or has terrible recall there), or upwards to 0.997 or 0.998 if the recall is extremely high (well over 0.9).
We aim for the threshold range of likelygood to not overlap with the threshold range of likelybad (meaning, there are no score values that are matched by both filters), because we don't want the same edit to be marked as both "likely good" and "likely bad". Overlap between likelygood and maybebad is OK.
If two models get very close to each other, or if there is no good precision/recall pair for a model, we might drop a model. For example, if high precision can't be achieved, verylikelybad might be dropped; or if 90% recall also has unusually high precision (say over 50%), we might drop maybebad.

Deploying ORES+RCFilters to a new wiki

Write a config patch that does the following things: (example config patch)
- Set wmgUseOres to true for the relevant wiki
- Set wgOresUiEnabled to true for the relevant wiki
- Add a wgOresFiltersThresholds setting for the wiki with the necessary overrides (and comments indicating with filters use the default setting)
- If necessary, add a wgOresModels setting disabling goodfaith (see the section on disabling goodfaith above)
- If necessary (i.e. if damaging likelybad is disabled), add a wmgOresDefaultSensitivityLevel setting for the wiki (see the pitfall warning above)
Merge the config patch and git pull it onto the deployment host
Run mwscript extensions/WikimediaMaintenance/createExtensionTables.php --wiki=xyzwiki ORES (where xyzwiki is the name of the wiki)
Run mwscript extensions/ORES/maintenance/PopulateDatabase.php --wiki=xyzwiki
- This takes a while, and it's OK to abort it early with Ctrl+C and re-run it later, but it needs to run at least briefly, so that entries for the damaging and goodfaith models are inserted into the ores_model table
Sync the config change

User interface and user-facing documentation

Once enabled, the ORES filters appear in the filter dropdown on Special:RecentChanges and Special:Watchlist. For more information and screenshots, see the user-facing documentation for RCFilters generally, and for the damaging and goodfaith filters specifically.

You can also visit Special:ORESModels to view the thresholds used for each filter, and their precision and recall values. This can be useful for debugging or analysis as well, since it can be hard to figure out what actually happens just by looking at the config.

Recalibrating models

As models get rebuilt, and especially as they get retrained, settings that were once appropriate become inappropriate. On some wikis, entire filters have disappeared because their threshold settings are no longer satisfiable. We need to periodically recalibrate all models, but we've never yet done this.

Footnotes

↑ Technically, it asks for the threshold with the highest recall out of the thresholds where the precision is at least 60%. This is often the same thing as "the first threshold where precision exceeds 60%", but it can be different if that's near a point where both precision and recall go up, or where precision goes down.

[1] Technically, it asks for the threshold with the highest recall out of the thresholds where the precision is at least 60%. This is often the same thing as "the first threshold where precision exceeds 60%", but it can be different if that's near a point where both precision and recall go up, or where precision goes down.

[1]