User:TJones (WMF)/Notes/Some Thoughts on the Math of Scoring

April 2017 — See TJones_(WMF)/Notes for other projects.

Introduction

About a year ago David and I had an email exchange about the math behind various kinds of scoring while we were talking about making changes to elements of our Elasticsearch scoring. I spent some time thinking very mathy thoughts about all this, and we had some back and forth on it, and it helped us think more clearly about scoring—not just for Elasticsearch, but in general.

Some of this came up again recently, so as a 10% project I’ve dusted off my notes, folded in some of the bits from our conversation, and cleaned it up and wikified it to share, and for future reference.

Key insights

Weights for any operation (+ or *) need to be the next hyperoperation up (* or ^).
Multiplication, like min, is looking for a way to fail—while addition, like max, is looking for a way to win.
We can create more rational scores by combining components in a reasonable, hierarchical fashion.
We could probably normalize score components better.
Empirical distribution functions can smooth out weird distributions.

Components of a Score

I think the basic steps to creating a scoring function are:

normalizing the components
shaping and weighting the components
combining the components

Normalizing the Components

We need to normalize scoring components so that we know that everything is working on roughly the same scale. Normalizing inputs between 0 and 1 allows us to us multiplicative, additive, exponential, min/max, and other methods more readily.

We need to look more closely at the distributions of the potential score components. We often normalize by dividing the approximate max value, or we take a log when the values span many orders of magnitude.

In the non-log cases, that means we are assuming a uniform distribution and approximating the cumulative distribution function (CDF) with a straight line from 0,0 to 1,1. Even for a uniform distribution over a shorter range this isn’t accurate.

For examples (and pictures!) of various probability distributions and their cumulative distribution functions, see these wiki pages:

Empirical Distribution Function

We could more accurately reflect the true distribution (without having to determine what it actually is) with an empirical distribution function (EDF). An EDF is a cumulative distribution function based on the actual distribution of your data, and it can be created or approximated with out having to find the underlying function that gives the right distribution, or even assuming that one exists. It readily maps values to percentiles.

An EDF eliminates some of the problems of imparting meaning to the numbers associated with a score number—e.g., are twice as many page views twice as valuable? For 1 vs 2? 2500 vs 5000 vs 10000? 5M vs 10M?—by effectively converting everything to a percentile rank.

Interestingly, transformations like log don’t have a huge effect on the empirical distribution function. Suppose you have 10 empirical data points, and #5 is 8 and #6 is 32. You can build an empirical distribution function on the original values, or on the binary log of the values. In either case, #5 is 5/10 and gets a score of 0.50, and #6 is 6/10 and gets a score of 0.60. The difference comes when you extrapolate between them. In the linear EDF, 16 would be ⅓ of the way between 8 and 32, so it would get a score of 0.533. In the log EDF, 16 is ½ way between 8 and 32, and so would get a score of 0.55. With a lot of relatively close data points in the EDF, the difference might be even smaller.

Of course that’s a lot of data to hold onto, so there are numerous ways to approximate an EDF. 3- or 4-segment approximations seem common after a brief literature review, though more complex formulations are available. It’s straightforward to approximate with n segments for any value of n.

Some refs for approximations to empirical cumulative distribution functions:

“Nonparametric Estimates of Cumulative Distribution Functions and Their Inverses” at MathWorks
“The best piecewise linearization of nonlinear functions” from Applied Mathematics
“Risk Measure Preserving Piecewise Linear Approximation of Empirical Distributions” from Philipp Arbenz’s website

A quick-and-dirty implementation of a piecewise-linear approximation to an empirical distribution function is available in the RelForge repo on GitHub. An example of the results of this algorithm, fitted to actual historical “incoming link counts” data is also available on Desmos. The original data set had 5M points, and can be very accurately estimated (within about half a percentile point) with a 20-point piecewise linear (PWL) approximation. Good approximations can also be had using a 1% sample (only 50K data points). 15-point PWL approximation gives errors less than one percentile point, and even a 10-point approximation is within 2.5 percentile points.

If an EDF is too much, we should at least try to investigate the empirical shape of the distribution and cumulative distribution and account for them better.

Min/Max (or 1st and 99th Percentile) Normalization

Another issue that comes up sometimes is that you want to normalize a variable (like a similarity score generated by Elasticsearch) that doesn’t really have proper bounds. For example, there may be no guaranteed maximum score, though they rarely go above 300; and while the guaranteed minimum score is definitely 0, you never actually see scores below, say, 10.

If you have access to a good chunk of real-world data, and an EDF is too much for your application, you can still get a better straight-line estimate by running a bunch of data and seeing what the highest and lowest scores in practice actually are. It is probably good to check the 1st and 99th percentile scores, too, to make sure than the absolute min and max aren’t weird outliers. Using your high and low scores (actual max and min, or 99th and 1st percentiles) you can linearly normalize anything like this:

$\frac{x - s c o r e_{l o w}}{s c o r e_{h i g h} - s c o r e_{l o w}}$

You should expressly cap the input to be between $s c o r e_{h i g h}$ and $s c o r e_{l o w}$ (or cap the output to be between 0 and 1) to deal with future outliers.

Sigmoid Saturation Function

Another nice shaping function, if you have some intuitions about your distribution and where the important parts of it are, is this function:

$\frac{x^{a}}{(x^{a} + k^{a})}$

It’s an S-shaped function that only takes on values between 0 and 1. When x == k the output value is 0.50, and the parameter a determines how steep the slope is. For a > 3, the sigmoid gets steeper as a increases. For values of a in the range (0,1], the curve is roughly logarithmic shaped (though it still maxes out at 1). When a == 0, you get a flat line at 0.5. And for values of a < 0, you have the same kinds of curves, except the value starts high at 1 and drops to 0.

An interactive version is available on Desmos.

Combining Components

I’m jumping ahead to combining because it informs the methods for shaping and weighting.

David and I had a lot of discussion of whether multiplication or addition of scores work well. I think I have come upon a useful set of intuitions that can guide when and how to use addition vs multiplication.

For this discussion, assume that higher scores are better, and so the highest score “wins”.

Consider min and max. Max is looking for a way to win—if you get a high score in A or B or C, you get a high score. You only need to excel in one dimension. Min is the opposite: it’s looking for a way to fail—no matter how well you do at A or B, if you don’t do well at C, too, you don’t do well.

Similarly, addition is looking for ways to win, multiplication is looking for ways to fail (when restricted to the range [0,1]).

Multiplication

Multiplication is smoother than min, but it is also looking for a way to fail. Obviously if any component is 0, the final score is 0, which is utter failure, no matter how awesome the other pieces are.

Given a linear distribution from 0 to 1 for each of two components, multiplying them looks like this:

x \cdot y

You can weight them, in the sense of changing how badly you have to do along one axis to fail, by using exponents or roots.

Here one dimension is squared (more likely to fail—on the x axis), the other has the square root taken (less likely to fail—on the y axis).

x^{2} \cdot \sqrt{y}

Here both have square roots applied, making it harder to fail in both directions, but still poor performance in either will limit the maximum overall score.

\sqrt{x} \cdot \sqrt{y}

Note: These three graphs were generated in R. Desmos doesn’t allow use of images of their graphs, but here’s a link to an interactive version with some weights to recreate the graphs above.

Geometric Mean—Normalizing the Product

When multiplying n weights together with a 0–1 range, the cumulative effect is to depress everything any distance from the “high point”, where all values are 1. To counteract that, you can take the geometric mean rather than the simple product—that is, the nth root for n elements in the product.

If you weight the components, you could take the weighted geometric mean, taking the root that is the sum of all the exponents. In three dimensions, this is:

$(A^{α} \cdot B^{β} \cdot C^{γ})^{\frac{1}{α + β + γ}}$

Addition

Addition is more like max, but smoother and—duh—more additive. Basic graphs are pretty boring, since they are all flat planes tilted this way or that, but obviously weights make certain components contribute more to the score (and increase the chance of “winning”).

I like normalizing additive weights back to 0–1 with the weighted mean. Given components A, B, and C all 0–1, and weights α, β, and γ, this gives a result from 0–1:

$\frac{α A + β B + γ C}{α + β + γ}$

Shaping and Weighting

For the multiplicative examples above, shaping and weighting are done via the exponents used for the geometric mean, if any. Multiplicative weights (e.g., 2x) don’t work with multiplicative combining because when applied to any component they have the same net overall effect—i.e., we need hyperoperations!

Hyperoperations

While messing around with multiplication and addition, I realized that the only way to have an effect on a particular kind of arithmetical combining is to use the next hyperoperation on the components.

When multiplying components you can’t weight them by multiplying any of them by a weight for the same reason that when adding components, you can’t weight them by adding a weight to any of them. You have to step up to the next hyperoperation. When adding components, you need to weight them by multiplying each by a weight. When multiplying components, you weight them by applying an exponent to each component.

You normalize your weights by doing the inverse next-hyperoperation with the sum of the weights. So if you use multiplicative weights α, β, and γ when adding, you need to divide by α + β + γ. if you use exponential weights α, β, and γ when multiplying, you need to take the (α + β + γ)th root.

When α, β, and γ are all one, that’s just the arithmetic mean (for addition) or geometric mean (for multiplication).

Interestingly, only addition and multiplication are commutative among the normal hyperoperations. Though there are commutative hyperoperations, which have $a^{l n (b)}$ instead of normal exponentiation as the next operation after multiplication.

Exponential Shaping

Note that the exponential shaping— $x^{a}$ vs $x^{\frac{1}{a}}$ —is not symmetric. The functions can be flipped along the axis from 0,1 to 1,0:

$x^{a} \to 1 - (1 - x)^{\frac{1}{a}}$
$x^{\frac{1}{a}} \to 1 - (1 - x)^{a}$

You could average them together to get a nice smooth symmetric curve, but there are probably better and more efficient options if you need that kind of symmetry.

An interactive graph of this flip is also on Desmos.

Another Abstract Example

On a normalized range of [0,1], exponents can be used to shape the score curve for a component, and then the weighted arithmetic mean can be used to weight the components in relation to each other.

For example:

$z = \frac{(2 \cdot x^{5.5} + 1 \cdot y^{\frac{1}{3}})}{2 + 1}$

Additive components can be shaped on the range [0,1] with exponents and roots. In the graph below, one dimension (the y axis) is emphasized with a ⅓ exponent, making the points easier to get, while the other dimension (the x axis) is de-emphasized with a 5.5 exponent, making the points harder to get.

The combination is additive, with the y dimension getting a weight of 1, and the x dimension getting a weight of 2.

So overall, along the y axis, points are easy to get, but aren’t worth as much; along the x axis, points are hard to get, but are worth more.

z = \frac{(2 \cdot x^{5.5} + 1 \cdot y^{\frac{1}{3}})}{2 + 1}

The graph above was generated in R. An interactive version with the default weights used here is on Desmos.

A Semi-Concrete Example—Combining Similarity and Popularity

I think it makes sense to think of combining many sub-scores recursively, possibly using different methods. The example below is thrown together to illustrate various options with a semi-plausible example, not as a serious proposal for a particular scoring function.

Suppose we have a query-matching score relating a document to a query (e.g., “similarity” based on TF/IDF or BM25), and two “popularity” elements, page views (popularity with readers), and PageRank (importance to editors).

We decide that something is only “popular” if it is popular in both dimensions, so we combine the popularity elements in a way aimed to “fail”—that is, most documents get a medium-to-lowish score. We also want to ignore outliers at the top of what is likely a power-law distribution.

Say median page views are 5,000, and we want to pay attention to moderate increases in popularity, so we cap effective page views at 25,000 (5x typical popularity), and normalize it as effective_pageviews/25,000 (rounding any score over 1 back down to 1 as “maximally popular”).

Say median PageRank is 3.5. We want to cap PageRank at 7 (2x importance to editors), and normalize PageRank as effective_PageRank/7 (and if anything manages to get a PageRank over 7, we also normalize it to 1).

We can take the geometric mean of the normalized page views and PageRank (multiply them and take the square root). Now we have a single “popularity” score based on page views and PageRank.

Say that in a sample of 5,000 queries, the similarity scores for returned results have 1st and 99th percentile scores of 23.1 and 2467.9 (ignoring one score of 0.000000001 and one score of 3,489,620,564,305,634,890.0004, along with a few others in a more typical range). An EDF is overkill, so we linearly normalize the similarity score in the range 23.1—2467.9.

We can combine that with a normalized similarity score clamped between [0,1], and weight them so that max popularity can overcome, say, 25% of the similarity score. That would be additive weights of 4 for similarity and 1 for popularity. Then re-normalize by dividing the result by 5.

$\frac{4 \cdot c l a m p (0, 1, \frac{s i m S c o r e - 23.1}{2467.9 - 23.1}) + \sqrt{\frac{m i n (25000, p a g e V i e w s)}{25000} \cdot \frac{m i n (7, P a g e R a n k)}{7}}}{4 + 1}$

This seems like a reasonable, rational way to hierarchically create scores from various components (even if this example isn’t a great scoring function).