Tuesday, April 03, 2007

Protection Exists

The sabermetric community has generally been successful at challenging baseball’s conventional wisdom. However, Bill James has noted a troubling historical pattern: sabermetric research often labels a phenomenon (e.g., clutch hitting) as non-existent when it can not be measured or detected. James suggests previous methods may be flawed and we may be “underestimating the fog” (note: that's a PDF link). Similarly, baseball statisticians may simply have too few instances when trying to measure something. Millions of pitches and at bats may not be sufficient given the statistical methods we are utilizing and the delicate trends we are attempting to find. Perhaps if baseball games lasted 100 innings and we had data sets containing 100 million at bats, we would then be able to detect more of the game’s mysteries.

My suggestion?
A production function – and its inputs – can provide clearer methods for measurement.

Take protection, as an example. Is a batter’s success at the plate affected by having Barry Bonds rather than Neifi Perez in the on-deck circle? Previous research has answered this question by looking at a batter’s outcome with varying hitters on-deck. Such analysis consistently arrives at the same conclusion – protection does not exist:
  • J.C. Bradbury, in his recent The Baseball Economist, states, “Protection is a myth.”
  • In Baseball Prospectus’ Baseball Between the Numbers, James Click concludes, “Batting performance does not change significantly with the quality of the following batter..."

Considering James’ fog argument, such prior research methods may be flawed. Bradbury’s regression analysis attempts to measure the effect of the on-deck hitter's quality on the current batter's outcome (his regression model has the on-deck hitter’s OPS on the right-hand side and the current batter’s outcome on the left-hand side). This approach is intuitive; in fact, my initial instinct might be to perform similar research. However, at bat outcomes involve many moving parts (where the ball lands, reaction of the defense, and luck, to name a few), and Bradbury is trying to measure the effect of an outcome-based rate (OPS) on another outcome. Thus, if there is some noise or randomness within the data, the problem would be compounded in the findings. When discussing the previous development of a new statistic (platoon differential), James summarizes the challenges:

“… the result embodies not just all of the randomness in two original statistics, but all of the randomness in four original statistics. Unless you have extremely stable ‘original elements’ – original statistics stabilized by hundreds of thousands of trials – then the result is, for all practical purposes, just random numbers.

We ran astray because we have been assuming that random data is proof of nothingness, when in reality random data proves nothing.”

How can we get around these issues?
Through the lens of a production function, we can analyze this same protection problem as a process involving inputs as well as outputs. For the “protection process,” inputs can describe what is going on within a pitch before the batter reacts to it (my data set tells me the location coordinates, velocity, pitch type, etc. of every pitch thrown in MLB since 2002). If protection exists, a batter’s experience at the plate would be distinctly different – in several advantageous ways – when a great hitter is standing in the on-deck circle. For example, we would expect a batter in front of Barry Bonds to see more pitches within the strike zone (the pitcher will nibble less) and more fastballs (pitcher has more control). Both MLB players and empirical research would agree that these two inputs provide a significant advantage to the batter.

My regression analysis attempts to measure the effect of a better on-deck hitter in this way, i.e., the effect on the current batter's experience in seeing both more pitches in the strike-zone and more fastballs. Specifically, either “strike-zone location: yes or no” or “fastball: yes or no” is my dependent variable (left-hand side). On the right-hand side, I include the OPS of the next hitter and then control for everything within the situation: the pitcher, the batter, pitch type, count, outs, runners, ballpark, and year.

Using per pitch data from 2002 through 2006, the results show that better on-deck hitters have a positive and significant effect on both the strike location and fastball inputs, and hence, protection does exist in so far as a pitcher adjusts his approach and a batter enjoys multiple advantages when a good hitter is on-deck.

Effect on a pitch being located within the strike-zone:


Coefficient (Std Error)

OPS of next batter

.0169


(.0029)


Effect on a pitch being a fastball:


Coefficient (Std Error)

OPS of next batter

.0107


(.0029)



The protection production function seems to tell us conflicting stories. The "input" findings show that protection exists, but the "output" evidence suggests that protection does not exist. So, which answer is correct? In addition to the potential randomness issue discussed earlier, outputs suffer from one other relative disadvantage – the mere volume of data being studied is different. Analysis at the per-pitch level (inputs) employs about four times the number of instances as per-at bat level analysis (outputs). Thus, while prior research may (or may not) point us in the right direction, I would argue that the production function's inputs push us much closer to the truth.

Lastly, moving beyond this discussion on protection, I want to be clear about my broader argument. The sabermetric community will benefit as it moves away from its relatively strict reliance on outcomes and outputs. Events on the field of any sport involve a great deal of processes. While outcome data (e.g., much of what you find online at great sites such as retrosheet and baseball-reference) have generally been more widely available, a full picture of economic analysis in the future will rely much more heavily on whole processes and their inputs.

If you would like to contact me directly, please email me at kkovash at gmail.com.