Batting average on balls in play (BABIP), or how often a batter is awarded a hit for a batted ball (excluding home runs and fouls), has been receiving a great deal of attention recently. The reason for this is that it has become widely believed that once a batter puts a ball in play, whether the ball is lined right at an infielder or just a foot to his left (and is thus unable to be fielded) is mere chance. As a result, a player whose hits happen to avoid fielders more often than those of other players will have higher BABIPs, but this is often due more to luck than skill, and this luckdependent statistic can have dramatic effects on a player’s more traditional statistics.
For example, let’s look at New York Yankee second baseman Robinson Cano’s BABIPs and triple slash rates over the past three years:
Year: AVG/OBP/SLG (BABIP)
2007: 0.306/0.353/0.488 (0.329)
2008: 0.271/0.305/0.410 (0.283)
2009: 0.318/0.351/0.509 (0.317)
While it isn’t clear which BABIP is more representative of Cano’s true BABIP, what remains obvious is that the fluctuation in BABIP can have rather large effects on more traditional statistics; this assumes, of course, that Cano did not suddenly become a worse player in 2008, then got better again in 2009, which could explain the dropoff in production in 2008.
Examining a player’s swings in performance in certain years and looking at the BABIPs during those years lead to similar results. It seems obvious that BABIP is rather fickle and bounces along rather frequently, and it’s usually pretty difficult to predict a randomly moving target. The fluctuations of BABIPs appear rather random, but I wondered–is it possible that these random fluctuations, when looked at over the entire population of major league players, are similarly random each year? If so, what sort of implications could that have for predicting BABIP?
Well, as it turns out, there is a name for processes that exhibit similar statistical properties over time. A process is said to be stationary if all aspects of its behavior are unchanged by shifts in time. In particular, there are weakly stationary processes, whose mean, variance, and covariance (but not skewness and kurtosis) are unchanged by time shifts. Real life examples of stationary process include the changes in stock prices, but not stock prices themselves. A quick refresher: the mean of a process or distribution is a fancy way of saying its average, while the variance of a distribution is a way to quantify how “spread out” the values are away from the average value.
I first sought to determine whether BABIP was a stationary process. To do this, my plan was to take all player seasons in which the player was of a certain age, then examine the BABIPs for those players as they aged. To test for stationarity, I would look at the statistical distribution of BABIPs for these players at each age; specifically, I would look at the mean of the BABIPs, as well as their variance, and hope to find that they remained constant over time. The reasoning behind choosing the seasons of a certain age only if the player played in previous seasons is to keep the pool of players in consideration the same from year to year. This ensures that the changes in BABIP are due to the player aging, not because there are different players being considered.
I used the standard and advanced batting statistics from BaseballReference.com from 1984 to 2008. I only included hitters that played at least 40 games in that year and averaged 3.1 plate appearances per game. The 40 was somewhat arbitrarily chosen, and, while I’m not sure of how many games I should have used, I don’t think the particular number changes the results of the study. After creating a list of players’ seasons that met these qualifications for each year between 1984 and 2008, I searched for all players who played at a certain age, I then collected the BABIPs for these players for each year following that year.
An example of this is the following: below are the BABIPs for players who played a year at age 19 (the table is cut off at age 26 due to spacing issues):
19

Age 19

Age 20

Age 21

Age 22

Age 23

Age 24

Age 25

Age 26

*Ken Griffey, Jr.

0.289

0.315

0.347

0.310

0.298

0.311

0.260

0.291

Ivan Rodriguez

0.301

0.296

0.297

0.298

0.314

0.304

0.339

0.349

Alex Rodriguez

0.295

0.382

0.328

0.324

0.281

0.333

0.325

0.290

Adrian Beltre

0.232

0.314

0.309

0.294

0.273

0.253

0.325

0.281

B.J. Upton

0.336


0.313

0.393

0.344




Justin Upton

0.287

0.332







Average

0.290

0.328

0.319

0.324

0.302

0.300

0.312

0.303

Next, I computed the variance at each age, first by calculating the squared difference between a player’s BABIP and the mean BABIP for all players at that age:

Age 19

Age 20

Age 21

Age 22

Age 23

Age 24

Age 25

Age 26

Var

0.0009

0.00086

0.0003

0.0013

0.0006

0.00086

0.0009

0.0007

The time series of the mean and variance of BABIPs can be plotted:
Note that the mean and variance are quite noisy. The mean fluctuates between 0.282 and 0.325, while the standard deviation (the square root of variance) moves from nearly 0 to 0.034. The noisiness might have something to do with the fact that players might still be physically maturing and might be getting faster at younger ages, but it’s probably more likely due to the small number of seasons with which we are given to use. I chose age 19 for this example primarily to show how the calculations were carried out, and the numbers here don’t seem to suggest that BABIP is a stationary process, but now, let’s look at years for which we have more data.
There were 945 seasons by players 27 years of age between 1984 and 2008. A graph of the average of their future BABIPs between ages 27 and 36 can be seen as the blue line below (I stopped at 36 since there were only 100 player seasons at age 36):
As you can see, the mean and variance of BABIP are fairly stable. The mean here never drops below 0.292 and never goes above .299, while the variance remains tightly between 0.00125 and 0.00150. Trending a linear model on the variance graph shows a slope of 0.000001, suggesting that the variance is largely constant throughout time. A closer look at the actual distribution of BABIPs can be found in Appendix A, which follows the main post.
Examining the results for other ages show results similar to these, which suggests stationarity of BABIP. By accepting this result, we’re saying that all baseball players at a certain age see their BABIPs vary similarly around the same BABIP as they grow older. While some may argue that players are more likely to see their BABIPs begin to drop as they get older, it is possible that this is offset by a player’s increased experience or willingness to train harder in order to stay in the game at an older age. As the Red Queen said in Lewis Carroll’s Through the Looking Glass, “it takes all the running you can do, to keep in the same place.”
Since our process appears to be stationary, we can now to fit the data to certain parsimonious models; here, parsimonious refers to models without excess variables. One type of model that we can now use to fit our data is called an autoregressive (AR) model. An autoregressive process of order p is a process whose forecasted observations are modeled as a weighted average of its previous p observations plus some error. In other words, autoregressive processes follow a regression model where Y_{t} is the “dependent” or “response” variable and past values Y_{i}, where 0 < i < t are the “independent” or “predictor” variables.
The simplest autoregressive process is the AR(1) process, which states that under certain restrictions:
Y_{t} – u = a * (Y_{t1} – b) + e_{t}
for all t, where a < 1 and e_t is normally distributed with mean zero and variance of s^2 (also called white noise). The AR stands for autoregressive, while the 1 denotes that only the Y_{t1} term is a dependent variable for predict Y_{t}. A possible interpretation of the term a(Y_{t1} – b) is that it represents “memory” of the past into the current value of the process. Other models, such as moving average (MA) models and autoregressive moving average (ARMA) models, can be used to fit the data as well, and some information about them are detailed in Appendix B.
If you remember from above, the one thing we haven’t tested for in order to be able to use an AR model is covariance. A way to test whether or not we can use an AR model is by looking at its autocorrelation function (ACF), which examines the correlation between all variables separated by any distance apart. So long as the autocorrelation function decreases as the lag increases and is not significantly away from zero at a lag greater than the order of our model, we can use an AR model.
The autocorrelation functions are not shown here, but an example of using AR models can be seen by fitting AR(2) models to the BABIP data for playerseasons of certain ages for which there is sufficient data. For us to use this model to predict 2009 BABIPs, you would have to take players who were 27 in 2008, then take their BABIPs at age 25 and 26, plug them into the model, then get their 2009 BABIPs. In short, these models, each for players of a certain age, are based on the BABIPs of a certain number of previous years’ BABIPs.
Using MATLAB gives us the following AR(2) models:
Age 21: Y_{t} = 1 – 1.139 * (1 – Y_{t1}) + 0.138 * (1Y_{t2})
Age 22: Y_{t} = 1 – 0.7559 * (1 – Y_{t1}) – 0.2435 * (1Y_{t2})
Age 23: Y_{t} = 1 – 0.5237 * (1 – Y_{t1}) – 0.4766 * (1Y_{t2})
Age 24: Y_{t} = 1 – 0.3756 * (1 – Y_{t1}) – 0.6241 * (1Y_{t2})
Age 25: Y_{t} = 1 – 0.2386 * (1 – Y_{t1}) – 0.7612 * (1Y_{t2})
Age 26: Y_{t} = 1 – 0.3228 * (1 – Y_{t1}) – 0.6775 * (1Y_{t2})
Age 27: Y_{t} = 1 – 0.8248 * (1 – Y_{t1}) – 0.1756 * (1Y_{t2})
Age 28: Y_{t} = 1 – 0.672 * (1 – Y_{t1}) – 0.3287 * (1Y_{t2})
Age 29: Y_{t} = 1 – 1.004 * (1 – Y_{t1}) + 0.001366 * (1Y_{t2})
Age 30: Y_{t} = 1 – 0.9495 * (1 – Y_{t1}) – 0.05216 * (1Y_{t2})
Age 31: Y_{t} = 1 – 0.86 * (1 – Y_{t1}) – 0.142 * (1Y_{t2})
Let’s use these preliminary models on the 2009 New York Yankees and 2009 Tampa Bay Rays and see how their 2009 BABIPs match up with those predicted by these models (older players are not predicted since models weren’t fitted due to sample size issues):
Player

2007 BABIP 
2008 BABIP 
2009 Predicted BABIP 
2009 Actual BABIP 
Mark Teixeira (29) 
0.342 
0.316 
0.314 
0.284 
Robinson Cano (26) 
0.329 
0.283 
0.314 
0.314 
Melky Cabrera (25) 
0.295 
0.271 
0.286 
0.278 
Nick Swisher (28) 
0.301 
0.249 
0.266 
0.272 
Dioner Navarro (25) 
0.249 
0.318 
0.266 
0.235 
Carlos Pena (31) 
0.297 
0.298 
0.296 
0.236 
Jason Bartlett (29) 
0.3 
0.332 
0.33 
0.39 
Carl Crawford (27) 
0.374 
0.297 
0.31 
0.361 
B. J. Upton (24) 
0.393 
0.344 
0.375 
0.316 
Gabe Gross (29) 
0.243 
0.279 
0.272 
0.333 
Lastly, some questions I anticipate being asked, along with my thoughts:
How is this different from xBABIP?
xBABIP, a description of which can be found here, is a pure linear regression model. Its regressors are various characteristics about a player, including line drive percentage, a measure of plate discipline, and contact rate, all of which must be measured and inputted into the model, and its authors found that all of their variables together explained about 35% of the variation in a hitter’s BABIP. Mainly, it is a descriptive model that uses data from one year to explain BABIP in that year alone.
Here, we showed that BABIP for the population of all baseball players remained stationary over time, which allowed us to use models like autoregressive models to predict future BABIP. These models only require a certain number of previous years’ BABIPs to come up with an estimate for the next year’s BABIP. Unlike xBABIP, this model is predictive in nature and attempts to forecast future BABIPs.
Is this better than xBABIP?
There’s still a great deal of work to be done before anything really conclusive can be seen. Going back to look at additional data and attempting to fit better models is a start, but I think there could be an empirical reason (i.e. experience) for the mean and variance of player BABIPs over time would stay the same, which means that this model is based on true assumptions.
What’s next?
I plan to go back and collect all of the yearly data from BaseballReference, then do a more careful job about why certain playerseasons are counted or not. Then, I’ll have to look more closely as to how many player seasons I want to try to fit my data to before trying to determine the best stationary model to use.
Also, there might be an argument for separating players based on their speed. Players that are speedsters in their early twenties might quickly drop off as they hit their thirties, so their mean BABIPs might actually decrease as they age. If this is the case, then stationarity is violated, meaning these models will no longer hold. We’ve briefly looked at Bill James’s Speed Score and looked at the “fastest” players–in particular, those with Speed Scores two standard deviations above the mean in any given year, but the sample size that results is laughably small. We will have to consider requiring less stringent requirements for the “fastest” players, or build in other factors as well.
If anyone has any suggestions regarding any of these, please feel free to comment.
This concludes the very introductory look into stationarity of BABIP and its possible implications for finding a model that best predicts future BABIP. The appendix, which includes information on the actual distribution of BABIPs over time as well as other possible stationary models, follows, and we’ll announce any adjustments or new findings as they come.
dj
Appendix A: Distribution of BABIPs
While we found that the mean and variance of BABIPs remain constant over time, I was curious to look at what the actual distribution of BABIPs was, so I looked at the seasons for players at ages 27 through 37 that played at age 27, and I binned the BABIPs by 1/100ths; that is, BABIPs greater than 0.310 and less than 0.320 were put into the “0.310 bin”, and bins were created for below 0.180 BABIP, between 0.180 and 0.190, all the way to between 0.410 and 0.420, then between 0.420 and 1.000. The histograms for the playerseasons between the ages 27 and 30 are below:
The results look surprisingly normal, but why just guess when you can test it? Using a “mean” and “variance” taken by averaging the means and variances over ages 27 and 37, I demeaned and scaled the BABIPs for each player for each year and used the KolmogorovSmirnov test for normality by testing the transformed BABIPs against the standard normal distribution. The results are below. Note that the number in the first column is the age of the BABIPs being tested, the second column is either 1 or 0, 1 for rejecting the null hypothesis of normality at the 5% significance level, and the number in the third column is the pvalue, or the probability that such a distribution of values could occur if truly drawn from a normal distribution; the pvalue is capped at 0.5.
Age 
H_0 
pvalue 
27 
0 
0.0997 
28 
0 
0.0869 
29 
0 
0.0352 
30 
0 
0.2688 
31 
0 
0.8715 
32 
0 
0.7591 
33 
0 
0.4093 
34 
0 
0.8195 
35 
0 
0.5836 
36 
0 
0.9466 
37 
0 
0.1074 
Here, there is no evidence for rejecting the null hypothesis, which means that there is no evidence to say that the BABIPs for each age over and including 27 is not normally distributed.
We can also use the Lilliefors test to test for normality. There is no demeaning or scaling taking place here as the test examines the data for a fit against a normal distribution with unknown parameters. For age 27, we have the following results (the format follows from above):
Age 
H_0 
pvalue 
27

1 
0.001 
28 
0 
0.2168 
29 
1 
0.0438 
30 
0 
0.0907 
31 
0 
0.1457 
32 
0 
0.5 
33 
0 
0.0984 
34 
0 
0.5 
35 
0 
0.5 
36 
0 
0.4022 
37 
0 
0.2851 
While normality is rejected for age 27 and 29, it is at the 4.4% significance level at the latter age, fairly close to the 5% significance level. In general, this seems to suggest that not only does BABIP have a constant mean and variance, each value in the process is normally distributed.
Appendix B: Other Stationary Models
Another model that requires stationary data is called the moving average (MA) model. While AR models show correlation at all lags, MA models only have correlation at short lags. The simplest moving average model is the MA(1) process, which states that:
Y_{t} – u = e_{t} – h * e_{t1}
where e_{t} and e_{t1} are against white noise variables.
Sometimes, however, you might want to fit a model to have properties of both AR and MA models. For that, we have autoregressive moving average (ARMA) models. An ARMA(p,q) model can be written as:
(Y_{t} – u) = a_{1}*(Y_{t1} – u) + … + a_{p}(Y_{tp}u) + e_{t} – h_{1}*e_{t1} – … – h_{q}*e_{tq}
Note that an ARMA(1,0) model reduces to an AR(1) model, while an ARMA(0,1) model reduces to an MA(1) model. Thus, we can use statistical software to fit our data for BABIPs following a given year to an ARMA model, then repeat the process for all given years.
If the process itself is not stationary but its differences are, then one uses an autoregressive integrated moving average (ARIMA) model, which is somewhat similar to ARMA models.