29 June, 2015

Likelihood Ratio Intervals for a Batting Average

I've previously discussed the central limit theorem and Wald theory as methods for giving intervals for parameters - in the simple case of identical, independent at-bats, with probability $\theta$ of getting a hit, both give the same result, which is

$\hat{\theta} \pm z^* \sqrt{\dfrac{\hat{\theta}(1-\hat{\theta})}{n}}$

though for more complicated problems, the two methods may not necessarily give the same result.

Those aren't the only methods for deriving confidence intervals for a batting average. In this post I'm going to derive another type of interval based, again, on the likelihood function, but using it in a different fashion than before.

All the code used to generate the images in this post may be found on my github.


Likelihood Ratio


 Statistical theory says that, in the case of a simple one-parameter model, a function of the ratios of the likelihoods (specifically, the maximum likelihood estimator $\hat{\theta}$ and another value $\theta_0$) follows a certain distribution:

$\Delta(\theta_0) = -2 \log\left(\dfrac{L(\theta_0)}{L(\hat{\theta})}\right) = -2[\ell(\theta_0) - \ell(\hat{\theta})] \sim \chi^2_1$

where $L(\theta)$ and $\ell(\theta)$ are the likelihood and log-likelihood functions as defined in this post. This statement can be inverted to get an interval for $\theta$. For a $\chi^2_1$ distribution, the $0.95$ quantile (that is, the value $k$ so that $P(\chi^2_1 \le k) = 0.95$ is at $k = 3.84$. Hence, by taking the set of $\theta_0$ so that

$ -2[\ell(\theta_0) - \ell(\hat{\theta})] \le 3.84$

you get a a 95% confidence interval for $\theta$.

Batting Averages


Recall that for the batting average model with $P(x_i = 1) = \theta$ representing a hit, $P(x_i = 0) = 1-\theta$ representing a non-hit, and independent and identical at-bats,  the likelihood function was

 $\ell(\theta) = \sum x_i \log(\theta) + (n - \sum x_i) \log(1-\theta)$

Which was maximized at the maximum likelihood estimator

$\hat{\theta} = \dfrac{\sum x_i}{n}$

Let's go back to having a player who gets $15$ hits in $n = 50$ at-bats. The maximum likelihood estimator for the batting average is then $\hat{\theta} = 15/50 = 0.300$.

Plugging in $\sum x_i = 15$, $n = 50$, and $\theta = 0.3$ into the log-likelihood equation gives $\ell(\hat{\theta}) = -30.54$. The function $\Delta$ then has the formula

$\Delta(\theta_0) = -2[\ell(\theta_0) - (-30.54)] = -2\ell(\theta_0) + 61.086$

And so a 95% confidence interval is given by the set of $\theta_0$ such that $\Delta(\theta_0) \le 3.84$. This is not easy to solve with calculus, so the easiest option is to use a computer to solve - graphing the function $\Delta(\theta_0)$ and placing a line at $3.84$ gives


You need to figure out where the curve crosses the line. There are a few ways to do this, but the easy way (which I did) is just calculate $\Delta(\theta_0)$ for a range (for a proportion, there is a finite range of values for $\theta$ - so this is easy) and figure out which values are closest to $3.84$. Doing so gave me a 95% confidence interval as $(0.185, 0.435)$.

Advantages/Disadvantages


Unlike the Wald theory and CLT-based interval, this interval is not dependent on any sort of asymptotic normality - and so can be used for small sample sizes. In fact, if you use it for $1$ hit in $n = 3$ at-bats, you get a graph of the delta function that looks like


and gives a 95% confidence interval of $(0.023,0 .839)$ - not a very useful interval, true, but 3 at-bats is almost no information!

(And note here that if you attempted to do a Wald/CLT interval, you would get $(-0.200,0.867)$ - saying there's a chance the batter's true average is negative)

What the interval does depend on, however, is curvature of the likelihood function at the maximum likelihood estimator $\hat{\theta}$. This is easy to check in one-dimensional cases, but not as easy when the dimensionality of the problem grows.

Another advantage is that the likelihood-based interval is invariant to transformation - that is to say, if you used this method to obtain an interval for a batter's odds of getting a hit, and transformed it back onto the average scale, you would get the same interval as if you had originally done it on the average scale. This is not true for the Wald/CLT intervals.

No comments:

Post a Comment