10 July, 2015

A Beta-Binomial Derivation of the 37-37 Shrinkage Rule

A rule of thumb is that to estimate a team's "true" ability $\theta_i$, you should add 74 games of 0.500 ball - that is,

$\hat{\theta_i} = \dfrac{w_i + 37}{n_i + 74}$

where $\hat{\theta_i}$ is the estimate of team $i$'s true winning proportion, $w_i$ is the number of wins of team $i$, and $n_i$ is the number of games team $i$ has played so far. Notice that the number of games stays the same no matter what $n_i$ is - if the team has played a full season ($n_i = 162$), shrink by 74 games of 0.500 ball. If the team has only played 10 games ($n_i = 10$), shrink by 74 games of 0.500 ball.

In this post I'm going to derive a very similar result as the posterior expectation of a binomial model with a beta prior, and try to give a less mathematical explanation of why the rule works no matter what $n_i$ is.

The code I used to generate the images in this post may be found on my github.

The Beta-Binomial Model


First off, let's assume that the number of wins $w_i$ follows a binomial distribution with number of games played $n_i$ and true winning proportion $\theta_i$.

$w_i \sim Bin(n_i, \theta_i)$

Furthermore, let's assume that the winning percentages themselves follow a beta distribution. Traditionally the beta distribution has parameters $\alpha$ and $\beta$, but I'm going to use the parametrization $\mu = \alpha/(\alpha + \beta)$ and $M = \alpha + \beta$. This makes $\mu$ the mean of the $\theta_i$ and $M$ control the variation - how spread out the $\theta_i$ are.

The reason I'm doing this is that we know what $\mu$ is - mathematically, we must have $\mu = 0.5$. Why? Because in a system like baseball, every win by one team represents a loss by another team. The wins and losses cancel out, the scales remain balanced, and the average $\theta_i$ must be equal to 0.5.

$\theta_i \sim Beta(0.5, M)$

If we knew $M$, we could just apply Bayes' rule to obtain the posterior distribution of the $\theta_i$ - but there's no intuitive value for it like there is for $\mu$. Thankfully, we have a way to get $M$.

Estimating M


Often, rather than working with the observed win totals $w_i$, people work with the observed win proportion $w_i/n_i$ (and in fact, we're going to use some data from that form in a bit). In a two-level model like this, we can calculate the variance of the observed win proportions as


$Var\left(\dfrac{w_i}{n_i}\right)  = E\left[Var\left(\dfrac{w_i}{n_i} \biggr |  \theta_i\right)\right] + Var\left(E\left[\dfrac{w_i}{n_i} \biggr | \theta_i\right]\right)$

The first part - $Var\left(w_i/n_i\right)$ - is the variance of the observed win proportions. I'm going to call this the total variance.

The second part -  $E\left[Var\left(w_i/n_i| \theta_i\right)\right]$ - is the average amount of variance of a team's observed winning proportion around its true winning proportion $\theta_i$. I'm going to call this the within-team variance (this is what others have referred to this as "luck").

This can be calculated as

$E\left[Var\left(\dfrac{w_i}{n_i} \biggr | \theta_i\right)\right] = E\left[\dfrac{\theta_i(1-\theta_i)}{n_i}\right]  = \dfrac{1}{n_i}(E[\theta_i(1-\theta_i)])$

$ = \left(\dfrac{1}{n_i}\right)\left(0.5(1-0.5) \right)\left(\dfrac{M}{M+1}\right) = \dfrac{0.25M}{n_i(M+1)}$

I'm skipping a bit of gory math here - $E[\theta_i(1-\theta_i)]$ can be found by noting that multiplying the $Beta(0.5, M)$ density by $\theta_i(1-\theta_i)$ produces the kernel of another beta density.

The last part - $Var(E[w_i/n_i | \theta_i])$ - is the variance of the $\theta_i$ themselves. It represent the natural variance of true winning proportions among all teams. I'm going to call this the between-team variance (this is what others have referred to as "talent").

This can be calculated as

$Var\left(E\left[\dfrac{w_i }{n_i} \biggr | \theta_i\right]\right) = Var(\theta_i) = \dfrac{0.5(1-0.5)}{M+1} = \dfrac{0.25}{M+1}$

Hence, the total variation in observed winning proportion is

$Var\left(\dfrac{w_i}{n_i}\right) = \dfrac{0.25M}{n_i(M+1)} +\dfrac{0.25}{M+1}$

Based on historical data, sports analyst Tom Tango suggests that the correct value of $Var(w_i/n_i)$ for teams that have played at least 160 games is $Var(w_i/n_i) =  0.07^2$. I'll trust him that this is accurate.
 
Notice from the formula above that the $Var(w_i/n_i)$ value is linked to the number of games used to estimate it - it's within-team variance plus between-team variance, and within-team variance is shrinking while the between-team variance is constant. This is why it's important to use a point estimate based off observations with the same number of games - the $n_i$ is constant in the formula above, and it becomes a function solely of $M$.

What we're going to do is assume that $n_i = 162$ for all the teams in the $Var(w_i/n_i) = 0.07^2$ value. This isn't technically true, but it's true for most of them, and the difference in within-team variation between a 160 win team and a 163 win team is very small, so we can safely ignore it. Then using Tom Tango's value, this sets up the equation

$0.07^2 = \dfrac{0.25M}{162 (M+1)} +\dfrac{0.25}{M+1}$ 

Doing a bit of algebra yields the value of M

$M = \dfrac{0.25*162-0.07^2*162}{0.07^2*162-0.25} = 73.01618$

Which is close enough that $M = 73$ can be used as the variance parameter for the distribution of the $\theta_i$. This is only one game smaller than the value of $74$ used in the rule of thumb, and the difference probably derives from different distributional choices used when calculating the between-team variance.

As a side note, the variance of observed win proportions in $n_i$ games is given by 

$Var\left(\dfrac{w_i}{n_i}\right) = \dfrac{0.25(73)}{n_i(74)} +\dfrac{0.25}{74}$

which implies that at $n_i = M = 73$ games, the within-team variance (luck) is equal to the between-team variance (talent).

 

Bayesian Estimator


Now that we have $M$, we can use treat the $Beta(0.5, 73)$ distribution as a the prior distribution for  $\theta_i$ as a prior and use Bayes' rule to get the posterior distribution for the $\theta_i$

$\theta_i  | w_i \sim Beta(w_i + 0.5*73, n_i - w_i + 0.5*73)$

(Here I'm using the traditional $\alpha$ and $\beta$ parametrization to define the beta distribution above)

And if we use $\hat{\theta_i} = E[\theta_i | w_i]$, that gives us the famous

$\hat{\theta} = \dfrac{w_i + 0.5*73}{w_i + 0.5*73 + n_i - w_i + 0.5*73} = \dfrac{w_i + 36.5}{n_i + 73}$

I want to add that this is not the only possible estimator that can be derived from the posterior - you could take the posterior mode rather than the mean, and calculate

$\hat{\theta_i} = \dfrac{y_i + 36.5 -1}{n_i + 73 - 2} = \dfrac{y_i + 35.5}{n_i + 71}$

That is, add 71 games of 0.500 ball to the team in order to shrink it, and it would still be a statistically justified estimator (since the beta posterior starts off bell-shaped and symmetric, however, the mean and the mode will remain very close - so these two estimates should closely coincide)

You could also use the posterior to calculate a credible interval for $\theta_i$ by taking quantiles from the posterior distribution - see my previous post on credible intervals for doing this in the beta-binomial situation. For example, if you have a team that has won $w_i = 6$ out of $n_i = 10$ games (for an observed 0.60 winning proportion), a 95% interval estimate for $\theta_i$ is given as $(0.405, 0.618)$. Similarly, if you have a team that has won $w_i = 96$ out of $n_i = 160$ games (again, for an observed 0.60 winning proportion), a 95% interval estimate for $\theta_i$ is given as $(0.505, 0.632)$.



Above is the posterior distribution for $\theta_i$ for the 6-4 team. The solid vertical lines represent the boundaries of the 95% credible interval and the dashed line represents $\hat{\theta_i}$.

Why Does this Happen?


And by that, I mean - why does do you add the same approximately 74 "shrinkage" games, no matter what the actual number of games played is?

As I write this post, the major league baseball season is currently underway. I'm going to ask you to estimate $\theta_i,$ the true winning proportion of team $i$. That doesn't sound tough, right? First, you'll want to know what team I'm thinking of.

Here's the thing, though: I'm not going to tell you which team I'm thinking of.

So how in the world can you guess how good a team is without knowing anything about it? Use what you know about baseball! How good is the average team? The average is 0.500, right? So let's start there. Your guess for $\theta_i$ is 0.500.

Okay, so now I want you to think of how much variation there is for $\theta_i$. What range do you think $\theta_i$ could possible be in? I'm guessing most people would agree with me that most teams are 60 to 100 win teams over the course of the season.

That sounds like a pretty reasonable estimate - and I'll add that there's a better chance of being near 81 than 60 or 100. This corresponds to a winning proportion of between approximately 0.37 and 0.617.

I'm not going to show the math, but this range can and should be adjusted a little bit based on historical data. When we calculate the correct range, it's the same as the range of observed win proportions you would expect a 0.500 team that has played 74 (or 73, by my calculation) games to be in. Check it- the within-team standard deviation (luck) is $\sqrt{0.5(1-0.5)/74} = 0.058$, and so two standard deviations below and above 0.500 gives a range of (0.384, 0.616), or roughly between 62 and 100 wins over a full season.

Okay, so now you've used your baseball knowledge to determine that without knowing anything about the team, you can guess how good it is by assuming it has gone 37-37. Now, let's get some information about the identity of team $i$. I'll start by telling you that the team went 0-2 in its first two games. Now try to estimate $\theta_i$.

The raw point estimate is $\hat{\theta_i} = 0/2 = 0$. But you don't really believe that the team has a true winning proportion of 0, do you? That would mean it never wins a single game the entire season. No team has ever won zero games before, or even come close. And you just told me that the most teams finish with between 60 and 100 wins!

But neither should you throw away the 0-2 information either -that's information as it is now, rather than about the hypothetical average team. What you want to do is mix your guess with your current information. Before I told you the name of the team, your guess represented a team that was 37-37. Now that you know the team is 0-2 - just add those to the what you thought before.

Your new estimate for the team's true winning percentage is $\hat{\theta_i} = (0+37)/(2+74) = 0.486$. You've shrunk a little bit towards zero to represent the new information, but not by much - two games is very little additional knowledge. So little, in fact, that you're better off leaning heavily on the 37-37 you figured before you even knew what the team was.

Now how about if I told you the team played 100 games and went 60-40? Great. We do the same thing - mix our guess about the average team with the new information we had before, by taking $\hat{\theta_i} = (60+37)/(100+74) = 0.557$.

As the number of games the team has played increases, the information you had about the hypothetical average team stays constant - but the information you have about the team current team grows. So adding 37-37 to whatever the current record is now means that the current information is slowly overtaking your guess - which is exactly what you want. If the number of games your hypothetical average team played was increasing, then the amount of information being mixed together from what you guessed and from what you observed would be increasing at the same rate - and so your current information would never overtake your guess! That's not what you want - so keeping 37-37 constant is the correct thing to do.

No comments:

Post a Comment