16 May, 2015

The James-Stein Estimator


If you want to make a prediction, an important question is "How do I judge accuracy?" There can be different ways.

For example, suppose you want to take the batting averages of a group of players in April and use them to predict the final batting average at the end of the regular season. How do you want to judge how close you got? You could take the average of the differences between your predictions and the actual batting averages. You could take the maximum difference between your prediction and the actual batting average. You could even focus on one side and acknowledge that hey, you don't care if a player over performs based on the prediction - you just want to make sure to minimize how much a player can under perform. All of these ways of judging your predictions could potentially lead to different prediction methods.

The first method - summing or average some function of the distances between the predicted and actual values - is both common, and has an estimator that goes along with it that isn't obvious.

The James-Stein Esimator


Let's say you have a set of observations $x_i$ that represent occurrences of some phenomenon, each of which has common parameter $\theta_i$ that controls some underlying aspect of that phenomenon - a baseball example would be to let $\theta_i$ be the "true" batting average of player $i$, if we can model each player as having a true, constant batting average representing his ability.

A common problem is then, how do we estimate $\theta_i$ with some estimator $\hat{\theta}_i$? The answer to that, as discussed previously, depends on how you are going to measure the accuracy of your estimate. A common approach is to use a squared loss function

$L_i(\theta_i, \hat{\theta}_i) = (\hat{\theta}_i - \theta_i)^2$

This presents a pretty good balance between bias and variance, and makes a lot of intuitive sense - squaring the distance makes everything positive so when you sum them up you can figure out a minimum, and it makes being very wrong is a lot worse than being a little wrong.

How then, do you measure the accuracy of the set of estimators? You could just sum the losses.

$L({\theta}, \hat{\theta}) = \displaystyle \sum_i (\hat{\theta}_i - \theta_i)^2$

In 1962, statistician Charles Stein showed a surprising result: first, let $x_i$ follow a normal distribution with known variance and mean $\theta_i$

$x_i \sim N(\theta_i, 1)$

where $i = 1, 2, ..., k$, for $k \ge 4$ (the number of $\theta$s has to be four or bigger - this is a hard rule that can't be broken)*. In the absence of any other assumptions about the form of the data, the simplest estimator $\hat{\theta}_i$ is the $\hat{\theta}_i = x_i$ (in the case of multiple observations per group, since the sample mean will also be normal, so we can without loss of generality assume one observation). This estimator tends to work well, an is, again, pretty intuitive - the baseball example of this would be to just use April's batting average to predict the final batting average.

 However, Stein showed that a better estimator is given by

$\tilde{\theta}_i = \overline{x} + \left(1 - \dfrac{k-3}{\sum (x_i - \overline{x})^2}\right)(x_i - \overline{x})$

The effect of this estimator is to shrink the estimate $\hat{\theta}_i$ towards the overall mean of the data $\overline{x}$, in an amount determined by both the variance of the $\hat{\theta}_i$ and the distance of $\hat{\theta}_i$ from the overall mean. On average, using $\tilde{\theta}_i$ actually works better in terms of the average squared loss defined above  than using the naive estimator $\hat{\theta}_i$. In some cases, it works much better.

Baseball


How do we apply this to strictly to baseball? Professors Bradley Efron and Carl Morris gave a classic example. The batting averages of 18 players through their first 45 at-bats of the 1970 baseball season is given below. The problem is to use the first 45 at-bats to predict the batting average for the remainder of the season.

\begin{array}{l c c} \hline
\textrm{Player} & x_i & \theta_i \\ \hline
Clemente & .400 & .346 \\
F. Robinson & .378 & .298\\
F. Howard & .356 & .276\\
Johnstone & .333 & .222\\
Barry & .311 & .273\\
Spencer & .311 & .270\\
Kessinger & .289 & .263\\
L. Alvarado & .267 & .210\\
Santo & .244 & .269\\
Swoboda & .244 & .230\\
Unser & .222 & .264\\
Williams & .222 & .256\\
Scott & .222 & .303\\
Petrocelli & .222 & .264\\
E. Rodriguez & .222 & .226\\
Campaneris & .200 & .285\\
Munson & .178 & .316\\
Alvis & .156 & .200\\ \hline
\end{array}

Where $x_i$ is the batting avg. for first 45 at-bats and $\theta_i$ is the batting average for the remainder of the season.

The first 45 at-bats were taken because the James-Stein estimator requires equal variance, and the variance is going to be dependent on the number of at-bats you look at. Here, the distribution of this data is clearly not normal - it is binomial. Hence, the transformation $f(x_i) = \sqrt{n}\arcsin(2x_i - 1)$ (in this case, n = 45) is used. Then the transformed data does fit the James-Stein model - it will follow a $N(\theta_i, 1)$ distribution.

The maximum likelihood estimate in each scenario of the remainder batting average would be $\hat{\theta}_i = x_i$. Applying the transformation to both $x_i$ and $\theta_i$, the squared loss using the naive estimates is then $17.56$. However, if the James-Stein estimator is applied to the transformed $x_i$, the squared loss is $5.01$ - an efficiency of $3.5 = \dfrac{17.56}{5.01}$.


\begin{array}{c c c c c} \hline
x_i & \textrm{Transformed }x_i & \textrm{JS transformed }x_i & \textrm{Back-transformed JS Estimates} & \theta_i\\ \hline
        0.400     & -1.351   &    -2.906   &         0.290 & 0.346\\
        0.378     & -1.653   &    -2.969   &         0.286 & 0.298\\
        0.356     & -1.960   &    -3.033   &         0.282 & 0.276\\
        0.333     & -2.284   &    -3.101   &         0.277 & 0.222\\
        0.311     & -2.600   &    -3.167   &         0.273 & 0.273\\
        0.311     & -2.600   &    -3.167   &         0.273 & 0.270\\
        0.289     & -2.922   &    -3.235   &         0.268 & 0.263\\
        0.267     & -3.252   &    -3.304   &         0.264 & 0.210\\
        0.244     & -3.606   &    -3.378   &         0.259 & 0.269\\
        0.244     & -3.606   &    -3.378   &         0.259 & 0.230\\
        0.222     & -3.955   &    -3.451   &         0.254 & 0.264\\
       0.222      &-3.955    &   -3.451    &        0.254 & 0.256\\
       0.222      &-3.955    &   -3.451    &        0.254 & 0.303\\
       0.222      &-3.955    &   -3.451    &        0.254 & 0.264\\
       0.222      &-3.955    &   -3.451    &        0.254 & 0.226\\
      0.200      &-4.317     &  -3.526     &       0.249 & 0.285\\
       0.178     & -4.694    &   -3.605    &        0.244 & 0.316\\
       0.156     & -5.090   &    -3.688    &        0.239 & 0.200\\ \hline
\end{array}

(R code for these calculations may be found in my github)

In fact, the James-Stein estimator dominates the least-squares estimator when the way to measure accuracy is the squared error function - outside of statistics speak, that means that no matter what your actual $\theta$s happen to be, on average, the James-Stein estimator will work better.

So what, then, is a situation where you would not want to use the James-Stein estimator? The answer is if you're concerned about the worst-case scenario for individual players. The James-Stein estimator - and shrinkage estimation in general - sacrifice accuracy on the individual level in order to increase accuracy on the group level. Players like Roberto Clemente, who we know are very good, are going to get shrunk the hardest. If the only player you care about is Robert Clemente, then using the raw April batting results is going to have the best average performance - even though on the group level, the James-Stein estimator will still "win" most of the time.


Shrinkage Estimation


In general, techniques that "shrink" estimates towards the mean will work better for group prediction than looking at each prediction individually. The James-Stein estimator, however, requires very specific assumptions - normality and equality of variances - that may be hard to meet. In future posts, I will discuss some ways to perform shrinkage estimation in other scenarios.

*Stein's estimator originally simply shrunk towards a common value and used $k-2$ in the estimator - however, in situations where the data is shrunk towards the estimated mean, $k-3$ must be used as a correction.