25 May, 2015

A Quick Check for Independence between Runs Scored and Runs Allowed


Is there any relationship between the number of runs scored in an inning and the number of runs allowed in an inning? Intuitively, it seems like there shouldn't be - but perhaps choosing to increase the rate of runs scored per inning means sacrificing defense, and therefore increasing the rate of runs allowed per inning.

Below is a table of the total runs scored and allowed in innings for all major league baseball teams in 2014.


        RA
RS       0     1     2     3     4     5     6     7     8     9
   0 23628  4446  2042   935   408   143    66    32    12     5
   1  4446   862   421   188    69    39    11     3     2     2
   2  2042   421   178    94    25    17     7     2     0     0
   3   935   188    94    30    11     5     2     2     0     0
   4   408    69    25    11     0     1     2     0     1     0
   5   143    39    17     5     1     2     0     0     0     0
   6    66    11     7     2     2     0     0     0     0     0
   7    32     3     2     2     0     0     0     0     0     0
   8    12     2     0     0     1     0     0     0     0     0
   9     5     2     0     0     0     0     0     0     0     0



This table does not include all innings, as teams that did not finish the ninth inning will not have a corresponding runs scored/allowed. Also note that this table does double count innings, as one team's RS is another team's RA, and vice versa - I will look at specific teams in a moment to address this.

So for example, in 2014, there were 23,628 innings in which a team both gave up 0 runs and scored 0 runs. How can we use this to test for a relationship between runs scored and runs allowed? A chi-squared test could work, but many of the cells have 0s in them - not ideal. Fisher's exact test requires that row and column cell totals be fixed - clearly not the case here - and Fisher exact test is more commonly used with smaller sample sizes and contingency table sizes (though it is still valid for larger sample sizes).

The solution I chose to use is to combine 5, 6, 7, 8, or 9 runs scored into a single category that I called ``5+''. The contingency table is then

                            RA
                RS      0     1     2     3     4    5+
                  0 23628  4446  2042   935   408   258
                  1  4446   862   421   188    69    57
                  2  2042   421   178    94    25    26
                  3   935   188    94    30    11     9
                  4   408    69    25    11     0     4
                 5+   258    57    26     9     4     2 


This is still a bit awkward - a few cells have counts less than 10* - but overall much, much better than before. Performing a chi-square test on this contingency table (I chose to simulate the p-value in $R$) yields a p-value of approximately 0.178 - not significant at a reasonable $\alpha$.

Individual Teams


As mentioned before, the table used was cheating - I combined all data for all major league teams, and so every result was double counted in some way. One way to address this is to look at each team individually, which has the added bonus of addressing the potential issue that dependence was washed out by the pooling. Below is a table of runs scored and runs allowed by the Atlanta Braves in the 2014 season.

                              RA
                RS    0   1   2   3   4   5   6
                  0 809 193  61  28   6   6   2
                  1 125  33  13   3   1   0   0
                  2  62  15   1   1   0   0   0
                  3  26   4   4   0   1   1   0
                  4  14   1   3   0   0   0   0
                  5   2   3   0   0   0   0   0
                  6   2   1   0   0   0   0   0
                  7   0   0   0   1   0   0   0


Again, not all innings were accounted for, since games that ended with only half of the ninth inning played would not have a corresponding RS or RA. To reduce the number of cells with zeroes, innings with 4 or more runs were combined into a single category.\\


                       RA
                    RS    0   1   2   3  4+
                      0 809 193  61  28  14
                      1 125  33  13   3   1
                      2  62  15   1   1   0
                      3  26   4   4   0   2
                     4+  18   5   3   1   0



The simulated p-value for the chi-square test with 100,000 simulations is 0.308 - still not significant. In fact, computing the p-value for each team's contingency table with 100,000 simulations yields



\begin{array}{c | c}\hline
Team & p-value \\ \hline
         ATL&       0.309\\
         ARI&       0.559\\
         BAL&       0.627\\
         BOS&       0.151\\
         CHC&       0.774\\
         CHW&       0.735\\
         CIN&       0.298\\
         CLE&       0.166\\
         COL&       0.986\\
        DET&       0.977\\
        HOU&       0.665\\
        KCR&       0.913\\
        LAA&       0.512\\
        LAD&       0.789\\
        MIA&       0.497\\
        MIL&       0.905\\
        MIN&       0.429\\
        NYM&       0.873\\
        NYY&       0.337\\
        OAK&       0.813\\
        PHI&       0.760\\
        PIT&       0.986\\
        SDP&       0.849\\
        SEA&       0.115\\
        SFG&       0.206\\
        STL&       0.902\\
        TBR&       0.629\\
        TEX&       0.126\\
        TOR&       0.597\\ \hline
\end{array}


There are no p-values below 0.1, so it looks reasonable to assume that runs scored and runs allowed per inning are independent.

This is definitely not the only - or even the most appropriate - way to answer the question of independence between runs scored and runs allowed, but represents an ad-hoc test of the sort that can be useful as a sanity check before proceeding onto other analyses.

*The Chi-square test for independence actually assumes that expected cell counts are more than a certain number - but using observed cell counts is useful as an approximation.

No comments:

Post a Comment