ORES/Model info/Statistics
This page contains an overview of the model fitness statistics that ORES presents with classifier models.
Custom documentation of metrics
editExample scenario
editLet’s assume a total of 100 edits, of which 35 are damaging – an unrealistically high ratio of damaging edits, but useful for illustration purposes – leaving us with the following labels (or actual values): 35 positives and 65 negatives as visualized by Figure 1.1 where each edit is represented by one editor.
Figure 1.1: Total of 100 edits, represented by 100 editors, divided into actual positives in green and actual negatives in red.A binary classifier might now predict 40 positives, of which 30 actually are positive and 60 negatives of which 55 actually are negative.
This also means that 10 non damaging edits have been predicted to be damaging and 5 damaging edits have been predicted not to be damaging. Figure 1.2 illustrates this state by marking predicted positives with a hazardous symbol and predicted negatives with a sun symbol. Referring to the confusion matrix, we have
- 30 true positives(correctly predicted damaging edits)
- 5 false negatives(wrongly predicted damaging edits)
- 55 true negatives(correctly predicted non damaging edits)
Figure 1.2: Edits divided into TP, FN, TN and FP.
- 10 false positives(wrongly predicted non damaging edits)We will get back to this example scenario in the definitions of metrics.2
Confusion Matrix
editAs we are faced with the binary damaging classifier, there are four different classification cases:
- Correctly classifying an edit as damaging – a true positive
- Wrongly classifying an edit as damaging – a false positive
- Correctly classifying an edit as good – a true negative
- Wrongly classifying an edit as good – a false negative
Popular representations of such cases are confusion matrices as the one in figure 1.3. Throughout this documentation, the abbreviations of TP, FP, TN and FN will be used to denote the four mentioned cases
Figure 1.3: Confusion matrix of a binary classifier. Predicted positives in red, predicted negatives in blue, consistent with PreCall’s design.
Metrics Overview
editBy performing optimization queries, we can tell ORES that we want a specific metric to be greater equal or lower equal to a specified value while maximizing or minimizing a second one. The following table comes with a quick definition and a value, if possible, based on the confusion matrix, for each metric:
Metric Quick Definition Value
- recall
- Ability to find all relevant cases TP/TP+FN
- precision
- Ability to find only relevant cases TP/TP+FP
- f1
- Harmonic mean of recall and 2·precision·recall/precision+recall
- fpr
- Probability of a false alarm FP/FP+TN
- rocauc
- Measure of classification performance
- prauc
- Measure of classification performance
- accuracy
- Portion of correctly predicted data TP+TN/Total
- match_rate
- Portion of observations predicted to be positive TP+FP/Total
- filter_rate
- Portion of observations predicted to be negative 1−match_rate=TN+FN/Total
- !recall
- Negated recall TN/TN+FP
- !precision
- Negated precision TN/TN+FN
- !f1
- Negated f1 2·!precision·!recall/!precision+!recall
Detailed definition of metrics
editrecall
editRecall (TP/TP+FN), true positive rate (tpr) or “sensitivity” of a model is the ability of that model to find all relevant cases within the dataset. To us, relevant case means damaging edit: The ability of the model to identify those depends on the ratio of actual positives being predicted as such with. In terms of numbers for our example that would be 30/30+5≈0.86.
precision
editPrecision (TPTP+FP) or “specificity” of a model is the ability of the model to find only relevant cases within the dataset. We are interested in how good the model is at only predicting edits to be damaging that actually are. Therefore, we want the ratio of true positives to all those predicted to be positive: 30/30 + 10= 0.75
f1
editf1-score, the harmonic mean of recall and precision, a metric from 0(worst) to 1 (best), serves as an accuracy evaluation metric. It is defined by 2·precision·recall/precision+recall. Note that unlike the average of recall and precision, the harmonic mean punishes extreme values. Referring to the example scenario, we get 2·0.75·3035/0.75 +3035= 0.8
fpr
editThe false positive rate (FP/FP+TN) answers the question of ‘what is the portion, of all actual negatives, that are wrongly predicted?’ and can be described as the probability of a false alarm. In our example, a false alarm would be predicting an edit as damaging that isn’t. As a result we get 10/10 + 55≈0.155
rocauc
editThe area under the ROC-curve, a measure between 0.5 (worthless) and 1.0 (perfect: getting no FPs), can be described as the probability of ranking a random positive higher than a random negative and serves as a measure of classification performance. The receiver operating characteristic (ROC) curve itself is used to visualize the performance of a classifier, plotting the TPR versus FPR as a function of the model’s threshold for classifying a positive. Assuming that we have had a threshold of 0.5 to get the previous results, one point on our ROC curve would be:(fpr,tpr) = (0.15,0.86). Doing this for every threshold wanted results in the ROC curve. The area under the curve (auc) is a way of quantifying its performance.
prauc
editSimilarly to the rocauc, the area under the precision recall curve evaluates a classifiers performance. The main difference, however, is that the PR curve plots precision versus recall and does not make use of true negatives. It is therefore favorable to use prauc over rocauc if true negatives are unimportant to the general problem or if there area lot more negatives than positives, since differences between models will be more notable in the absence of a vast amount of true negatives in that second case. The point on the PR curve of our example for the standard threshold of 0.5 is (precision,recall) = (0.75,0.86) To construct the PR-curve, it would be necessary to do this for every threshold wanted. Again, calculating the area under it is a way to quantify the curve’s performance and therefore the model’s performance as well.
accuracy
editAccuracy (TP+TN/Total) measures the ratio of correctly predicted data: positives and negatives. In the example, this is the proportion of correctly predicted damaging edits and correctly predicted non damaging edits to the total of edits and is given by 30 + 55/35 + 65=0.85
match_rate
editThe match rate (TP+FP/Total) is the ratio of observations predicted to be positive. Concerning our damaging classifier, this is equal to wanting to know the ratio of edits predicted to be damaging, which is given by 30 + 10/35 + 65=0.4
filter_rate
editThe filter rate (1−match_rate=TN+FN/Total) is the ratio of observations predicted to be negative. This is the complement to the match rate. In the example, the filter rate describes the ratio of edits predicted to be damaging, given by (1−match_rate=55 + 5/35 + 65= 0.6).
!<metric>
editAny metric with an exclamation mark is the same metric for the negative class:–!recall =TN/TN+FP, the ability of a model to predict all negatives as such–!precision =TN/TN+FN, the ability of a model to only predict negatives as such–!f1 = 2·!recall·!precision/!recall+!precision, the harmonic mean of !recall and !precision.
Note that these metrics are also particularly useful for multi-class classifier as they permit queries to reference all but one class, e.g. in the ORES itemquality model, the recall for all classes except the “E” class comes down to the !recall of the “E” class.