Sunday, January 31, 2021

Splitting defensive credit between pitchers and fielders (Part III)

(This is part 3. Part 1 is here; part 2 is here.)

UPDATE, 2021年02月01日: Thanks to Chone Smith in the comments, who pointed out an error. I investigated and found an error in my code. I've updated this post -- specifically, the root mean error and the final equation. The description of how everything works remains the same.

------

Last post, we estimated that in 2018, Phillies fielders were 3 outs better than league average when Aaron Nola was on the mound. That estimate was based on the team's BAbip and Nola's own BAbip.

Our first step was to estimate the Phillies' overall fielding performance from their BAbip. We had to do that because BAbip is a combination of both pitching and fielding, and we had to guess how to split those up. To do that, we just used the overall ratio of fielding BAbip to overall BAbip, which was 47 percent. So we figured that the Phillies fielders were -24, which is 47 percent of their overall park-adjusted -52.

We can do better than that kind of estimate, because, at least for recent years, we have actual fielding data that can substitute for that estimate. Statcast tells us that the Phillies fielders were -39 outs above average (OAA) for the season*. That's 75 percent of BAbip, not 47 percent ... but still well within typical variation for teams.

(*The published estimate is -31, but I'm adding 25 percent (per Tango's suggestion) to account for games not included in the OAA estimate.)

So we can get much more accurate by starting with the true zone fielding number of -39, instead of the weaker estimate of -24.

-------

First, let's convert the -39 back to BAbip, by dividing it by 3903 BIP. That gives us ... almost exactly -10 points.

The SD of fielding talent is 6.1. The SD of fielding luck in 3903 BIP is 3.65. So it works out that luck is 2.6 of the 10 points, and talent is the remaining 7.3. (That's because 2.6 = 3.65^2/(3.65^2+6.1^2).)

We have no reason (yet) to believe Nola is any different from the rest of the team, so we'll start out with an estimate that he got team average fielding talent of -7.3, and team average fielding luck of -2.6.

Nola's BAbip was .254, in a league that was .296. That's an observed 41 point benefit. But, with fielders that averaged .00074 talent and -0.0026 luck, in a park that was +0.0025, that +41 becomes +48.5.

That's what we have to break down.

Here's Nola's SD breakdown, for his 519 BIP. We will no longer include fielding talent in the chart, because we're using the fixed team figure for Nola, which is estimated elsewhere and not subject to revision. But we keep a reduced SD for fielding luck relative to team, because that's different for every pitcher.

9.4 fielding luck
7.6 pitching talent
17.3 pitching luck
1.5 park
--------------------
21.2 total

Converting to percentages:

20% fielding luck
13% pitching talent
67% pitching luck
1% park
--------------------
100% total

Using the above percentages, the 48.5 becomes:

+ 9.5 points fielding luck
+ 6.3 points pitching talent
+32.5 points pitching luck
+ 0.2 points park
-------------------
+48.5 points

Adding back in the -7.3 points for observed Phillies talent, -2.6 for Phillies luck, and 2.5 points for the park, gives

-7.3 points fielding talent [0 - 7.3]
+6.9 points fielding luck [+10.2 - 2.6]
+6.3 points pitching talent
+32.5 points pitching luck
+2.7 points park [0.2 + 2.5]
-----------------------------------------
41 points

Stripping out the two fielding rows:

-7.3 points fielding talent
+6.9 points fielding luck
-----------------------------
-0.4 points fielding

The conclusion: instead of hurting him by 10 points, as the raw team BAbip might suggest, or helping him by 6 points, as we figured last post ... Nola's fielders only hurt him by 0.4 points. That's less than a fifth or a run. Basically, Nola got league-average fielding.

--------

Like before, I ran this calculation for all the pitchers in my database. Here are the correlations to actual "gold standard" OAA behind the pitcher:

r=0.23 assume pitcher fielding BAbip = team BAbip
r=0.37 BAbip method from last post
r=0.48 assume pitcher OAA = team OAA
r=0.53 this method

And the root mean square error:

13.7 assume pitcher fielding BAbip = team BAbip
11.3 BAbip method from last post
10.2 assume pitcher OAA = team OAA
10.0 this method

-------

Like in the last post, here's a simple formula that comes very close to the result of all these manipulations of SDs:

F = 0.8*T + 0.2*P

Here, "F" is fielding behind the pitcher, which is what we're trying to figure out. "T" is team OAA/BAbip. "P" is player BAbip compared to league.

Unlike the last post, here the team *does* include the pitcher you're concerned with. We had to do it this way because presumably we have data for the team without the pitcher. (If we did, we'd just subtract it from team and get the pitcher's number directly!)

It looks like 20% of a pitcher's discrepancy is attributable to his fielders. That number is for workloads similar to those in my sample -- around 175 IP. It does with playing time, but only slightly. At 320 IP, you can use 19% instead. At 40 IP, you can use 22%. Or, just use 20% for everyone, and you won't be too far wrong.

-------

Full disclosure: the real life numbers for 2017-19 are different. The theory is correct -- I wrote a simulation, and everything came out pretty much perfect. But on real data, not so perfect.

When I ran a linear regression to predict OAA from team and player BIP, it didn't come out to 20%. It came out to only about 11.5%. The 95% confidence interval only brings it up to 15% or 16%.

The same thing happened for the formula from the last post: instead of the predicted 26%, the actual regression came out to 17.5%.

For the record, these are the empirical regression equations, all numbers relative to league:

F = 0.23*(Team BAbip without pitcher) + 0.175*P
F = 0.92*(Team OAA/BIP including pitcher) + 0.115*P

Why so much lower than expected? I'm pretty sure it's random variation. The empirical estimate of 11.5% is very sensitive to small variations in the seasonal balance of variation in pitching and fielding luck vs. talent -- so sensitive that the difference between 11.5 points and 20 points is not statistically significant. Also, the actual number changes from year-to-year because of variation. So, I believe that the 20% number is correct as a long-term average, but for the seasons in the study, the actual number is probably somewhere between 11.5% and 20%.

I should probably explain that in a future post. But, for now, if you don't believe me, feel free to use the empirical numbers instead of my theoretical ones. Whether you use 11.5% or 20%, you'll still be much more accurate than using 100%, which is effectively what happens when you use the traditional method of assigning the overall team number equally to every pitcher.















Labels: , , ,

posted by Phil Birnbaum @ 1/31/2021 02:47:00 PM 5 comments

Monday, January 11, 2021

Splitting defensive credit between pitchers and fielders (Part II)

(Part 1 is here. This is Part 2. If you want to skip the math and just want the formula, it's at the bottom of this post.)

------

When evaluating a pitcher, you want to account for how good his fielders were. The "traditional" way of doing that is, you scale the team fielding to the pitcher. Suppose a pitcher was +20 plays better than normal, and his team fielding was -5 for the season. If the pitcher pitched 10 percent of the team innings, you might figure the fielding cost him 0.5 runs, and adjust him from +20 to +20.5.

I have argued that this isn't right. Fielding performance varies from game to game, just like run support does. Pitchers with better ball-in-play numbers probably got better fielding during their starts than pitchers with worse ball-in-play numbers.

By analogy to run support: in 1972, Steve Carlton famously went 27-10 on a Phillies team that was 32-87 without him. Imagine how good he must have been to go 27-10 for a team that scored only 3.22 runs per game!

Except ... in the games Carlton started, the Phillies actually scored 3.76 runs per game. In games he didn't start, the Phillies scored only 3.03 runs per game.

The fielding version of Steve Carlton might be Aaron Nola in 2018. A couple of years ago, Tom Tango pointed out the problem using Nola as an example, so I'll follow his lead.

Nola went 17-6 for the Phillies with a 2.37 ERA, and gave up a batting average on balls in play (BAbip) of only .254, against a league average of .295 -- that, despite an estimate that his fielders were 0.60 runs per game worse than average. If you subtract 0.60 from Nola's stat line, you wind up with Nola's pitching equivalent to an ERA in the 1s. As a result, Baseball-Reference winds up assigning Nola a WAR of 10.2, tied with Mike Trout for best in MLB that year.

But ... could Nola really have been hurt that much by his fielders? A BAbip of .254 is already exceptionally low. An estimate of -0.60 runs per game implies his BAbip with average fielders would have been .220, which is almost unheard of.

(In fairness: the Phillies 0.60 DRS fielding estimate, which comes from Baseball Info Solutions, is much, much worse than estimates from other sources -- three times the UZR estimate, for instance. I suspect there's some kind of scaling bug in recent BIS ratings, because, roughly, if you divide DRS by 3, you get more realistic numbers, and standard deviations that now match the other measures. But I'll save that for a future post.)

So Nola was almost certainly hurt less by his fielders than his teammates were, the same way Steve Carlton was hurt less by his hitters than his teammates were. But, how much less?

Phrasing the question another way: Nola's BAbip (I will leave out the word "against") was .254, on a team that was .306, in a league that was .295. What's the best estimate of how his fielders did?

I think we can figure that out, extending the results in my previous post.

------

First, let's adjust for park. In the five years prior to 2018, (削除) the Phillies (削除ここまで) BAbip for both teams combined was .0127 ("12.7 points") better at Citizens Bank Park than in Phillies road games. Since only half of Phillies games were at home, that's 6.3 points of park factor. Since there's a lot of luck involved, I regressed 60 percent to the mean of zero (with a limit of 5 points of regression, to avoid ruining outliers like Coors Field), leaving the Phillies with 2.5 points of park factor.

Now, look at how the Phillies did with all the other pitchers. For non-Nolas, the team BAbip was .3141, against a league average of .2954. Take the difference, subtract the park factor, and the Phillies were 21 points worse than average.

How much of those 21 points came from below-average fielding talent? To figure that out, here's the SD breakdown from the previous post, but adjusted. I've bumped luck upwards for the lower number of PA, dropped park down to 1.5 since we have an actual estimate, and increased the SD of pitching because the Phillies had more high-inning guys than average:

6.1 points fielding talent
3.9 points fielding luck
5.6 points pitching talent
6.8 points pitching luck
1.5 points park
---------------------------
11.5 points total

Of the Phillies' 21 points in BAbip, what percentage is fielding talent? The answer: (6.1/11.5)^2, or 28 percent. That's 5.9 points.

So, we assume that the Phillies' fielding talent was 5.9 points of BAbip worse than average. With that number in hand, we'll leave the Phillies without Nola and move on to Nola himself.

-------

On the raw numbers, Nola was 41 points better than the league average. But, we estimated, his fielding was about 6 points worse, while his park helped him by 2.5 points, so he was really 44.5 points better.

For an individual pitcher with 700 BIP, here's the breakdown of SDs, again from the previous post:

6.1 fielding talent
7.6 fielding luck
7.6 pitching talent
15.5 pitching luck
3.5 park
---------------------
20.2 total

We have to adjust all of these for Nola.

First, fielding talent goes down to 5.2. Why? Because we estimated it from other data, and so we have less variance than if we just took the all-time average. (A simulation suggests that we multiply the 6.1 by, from the "team without Nola" case, (SD without fielding talent)/(SD with fielding talent).)

Fielding luck and pitching luck increase because Nola had only 519 BIP, not 700.

Finally, park goes to 1.5 for the same reason as before.

5.2 fielding talent
10.0 fielding luck
7.6 pitching talent
17.3 pitching luck
1.5 park
--------------------
22.1 total

Convert to percentages:

5.5% fielding talent
20.4% fielding luck
11.8% pitching talent
61.3% pitching luck
0.5% park
---------------------
100% total

Multiply by Nola's 44.5 points:

2.5 fielding talent
9.1 fielding luck
5.3 pitching talent
27.3 pitching luck
0.2 park
--------------------
44.5 total

Now we add in our previous estimates of fielding talent and park, to get back to Nola's raw total of 41 points:
-3.4 fielding talent [2.5-5.9]
9.1 fielding luck
5.3 pitching talent
27.3 pitching luck
2.7 park [0.2+2.5]
------------------------------
41 total

Consolidate fielding and pitching:

5.6 fielding
32.6 pitching
2.7 park
-------------
41 total

Conclusion: The best estimate is that Nola's fielders actually *helped him* by 5.6 points of BAbip. That's about 3 extra outs in his 519 BIP. At 0.8 runs per out, that's 2.4 runs, in 212.1 IP, for about 0.24 WAR or 10 points of ERA.

Baseball-reference had him at 60 points of ERA; we have him at 10. Our estimate brings his WAR down from 10.3 to 9.1, or something like that. (Again, in fairness, most of that difference is the weirdly-high DRS estimate of 0.60. If DRS had him at a more reasonable .20, we'd have adjusted him from 9.4 to 9.1, or something.)

-------

Our estimate of +3 outs is ... just an estimate. It would be nice if we had real data instead. We wouldn't have to do all this fancy stuff if we had a reliable zone-based estimate specifically for Nola.

Actually, we do! Since 2017, Statcast has been analyzing batted balls and tabulating "outs above average" (OAA) for every pitcher. For Nola, in 2018, they have +2. Tom Tango told me Statcast doesn't have data for all games, so I should multiply the OAA estimate by 1.25.

That brings Statcast to +2.5. We estimated +3. Not bad!

But Nola is just one case. And we might be biased in the case of Nola. This method is based on a pitcher of average talent. Nola is well above average, so it's likely some of the difference we attributed to fielding is really due to Nola's own BAbip pitching tendencies. Maybe instead of +3, his fielders were really +1 or something.

So I figured I'd better test other players too.

I found all pitchers from 2017 to 2019 that had Statcast estimates, with at least 300 BIP for a single team. There were a few players whose names didn't quite correlate with my Lahman database, so I just let those go instead of fixing them. That left 342 pitcher-seasons. I assume almost all of them were starters.

For each pitcher, I ran the same calculation as for Nola. For comparison, I also did the "traditional" estimate where I gave the pitcher the same fielding as the rest of the team. Here are the correlations to the "gold standard" OAA:

r=0.37 this method
r=0.23 traditional

Here are the approximate root-mean-square errors (lower is better):

11.3 points of BAbip this method
13.7 points of BAbip traditional

This method is meant to be especially relevant for a pitcher like Nola, whose own BAbip is very different from his team's. Here are the root-mean-squared errors for pitchers who, like Nola, had a BAbip at least 10 plays better than their team's:

9.3 points this method
11.9 points traditional

And for pitchers at least 10 plays worse:

9.3 points this method
10.9 points traditional

------

Now, the best part: there's an easy formula to get our estimates, so we don't have to use the messy sums-of-squares stuff we've been doing so far.

We found that the original estimate for team fielding talent was 28% of observed-BAbip-without-pitcher. And then, our estimate for additional fielding behind that pitcher was 26% of the difference between that pitcher and the team. In other words, if the team's non-Nola BAbip (relative to the league) is T, and Nola's is P,

Fielders = .28T + .26(P-.28T)

The coefficients vary by numbers of BIPs. But the .28 is pretty close for most teams. And, the .26 is pretty close for most single-season pitchers: luck is 25% fielding, and talent is about 30% fielding, so no matter your proportion of randomness-to-skill, you'll still wind up between 25% and 30%.

Expanding that out gives an easier version of the fielding adjustment, which I'll print bigger.

------

Suppose you have an average pitcher, and you want to know how much his fielders helped or hurt him in a given season. You can use this estimate:

F = .21T + .26P

Where:

T is his team's BAbip relative to league for the other pitchers on the team, and

P is the pitcher's BAbip relative to league, and

F is the estimated BAbip performance of the fielders, relative to league, when that pitcher was on the mound.


-----

Next: Part III, splitting team OAA among pitchers.




Labels: , , ,

posted by Phil Birnbaum @ 1/11/2021 04:03:00 PM 2 comments

Tuesday, December 29, 2020

Splitting defensive credit between pitchers and fielders (Part I)

(Update, 2020年12月29日: This is take 2. I had posted this a few days ago, but, after further research, I tweaked the numbers and this is the result. Explanations are in the text.)

-----

Suppose a team has a good year in terms of opposition batted ball quality. Instead of giving up a batting average on balls in play (BAbip) of .300, their opponents hit only .280. In other words, they were .020 better than average in turning (inside-the-park) batted balls into outs.

How much of those "20 points" was because of the fielders, and how much was because of the pitcher?

Thanks to previous work by Tom Tango, Sky Andrecheck, and others, I think we have what we need to figure this out. If you don't want to see the math or logic, just head to the last section of this post for the two-sentence answer.

------

In 2003, a paper called "Solving DIPS," (by Erik Allen, Arvin Hsu, Tom Tango, et al) did a great job in trying to establish what factors affect BAbip, and in what proportion. I did my own estimation in 2015 (having forgotten about the previous paper). I'll use my breakdown here.

Looking at a large number of actual team-seasons, I found that the observed SD of BAbip was 11.2 points. I estimated the breakdown of SDs as:


7.7 fielding talent
2.5 pitching staff talent
7.1 luck
2.5 park
--------------------------
11.0 total

(If you haven't seen this kind of chart before, the "total" doesn't actually add up to the components unless you square them all. That's how SDs work -- when you have two independent variables, the SD of their sum is the square root of the sum of their squares.)

OK, this is where I update a bit from the numbers in the previous version of this post.

First, I'm bumping the SD of park from 2.5 points to 3.5 points, to match Tango's numbers for 1999-2002. Second, I'm bumping luck to 7.3, since that's the theoretical value (as I'll calculate later). Third, I'm bumping the pitching staff to 4.3, because after checking, it turns out I made an incorrect mathematical assumption in the previous post. Finally, fielding talent drops to 6.1 to make it all add up. So the new breakdown:


6.1 fielding talent
4.3 pitching staff talent
7.3 luck
3.5 park
--------------------------
11.0 total


----

We can use that chart to break the team's 20-point advantage into its components. But ... we can't yet calculate how much of that 20 points goes to the fielders, and how much to the pitchers. Because, we have an entry called "luck". We need to know how to break down the luck and assign it to either side.

Your first reaction might be -- it's luck, so why should we care? If we're looking to assign deserved credit, why would we want to assign randomness?

But ... if we want to know how the players actually performed, we *do* want to include the luck. We want to know that Roger Maris hit 61 home runs in 1961, even if it's undoubtedly the case that he played over his head in doing so. In this context, "luck" just means the team did somewhat better or worse than their actual talent. That's still part of their record.

Similarly here. If a team gets lucky in opponent BAbip, all that means is they did better than their talent suggests. But how much of that extra performance was the pitchers, giving up easier balls in play? And how much was the fielders, making more and better plays than expected?

That's easy to figure out if we have zone-type fielding stats, calculated by watching where the ball is hit (and sometimes how fast and at what angle), and figuring out the difficulty of every ball, and whether or not the fielders were able to turn it into an out. With those stats, we don't have to risk "blaming" a fielder for not making a play on a bloop single he really had no chance on.

So where we have those stats, and they work, we have the answer right there, and this post is unnecessary. If the team was +60 runs on balls in play, and the fielders' zone ratings add up to +30, that's half-and-half, so we can say that the 20-point BAbip advantage was 10 points pitching and 10 points hitting.

But for seasons where we don't have the zone rating, what do we do, if we don't know how to split up the luck factor?

Interestingly, it will the stats compiled by the Zone Rating people that allow us to calculate estimates for the years in which we don't have them.

------

Intuitively, the more common "easy outs" and "sure hits" are, the less fielders matter. In fact, if *all* balls in player were 0% or 100%, fielding performance wouldn't matter at all, and fielding luck wouldn't come into play. All the luck would be in what proportion the pitcher split between 0s and 100s.

On the other hand, if all balls in play were exactly the league average of 30%, it would be the other way around. There would be no difference in the types of hits pitchers gave up, which means there would be no BAbip pitching luck at all. All the luck would be in whether the fielders handled more or fewer than 30% of the chances.

So: the more BIP are "near-automatic" hits or "near-automatic" outs, the more pitchers matter. The more BIP that could go either way, the more fielders matter.

That means we need to know the distribution of ball-in-play difficulty. And that's the data that we wouldn't have without the development of Zone ratings now keeping track of it.

The data I'm using comes from Sky Andrecheck, who actually published it in 2009, but I didn't realize what it could do until now. (Actually, I'm repeating some of Sky's work here, because I got his data before I saw his analysis of it. See also Tango's post at his old blog.)

Here's the distribution. Actually, I tweaked it just a tiny bit to make the average work out to .300 (.29987) instead of Sky's .310, for no other reason than I've been thinking .300 forever and didn't want to screw up and forget I need to use .310. Either way, the results that follow would be almost the same.


43.0% of BIP: .000 to .032 chance of a hit*
23.0% of BIP: .032 to .140 chance of a hit
10.3% of BIP: .140 to .700 chance of a hit
4.7% of BIP: .700 to 1.000 chance of a hit
19.0% of BIP: 1.000 chance of a hit
---------------------------------------------
overall average: really close to .300

(*Within a group, the probability is uniform, so anything between .032 and .140 is equally likely once that group is selected.)


The SD of this distribution is around .397. Over 3900 BIP, which I used to represent a team-season, it's .00636. That's the SD of pitcher luck.

The random binomial SD of BAbip over 3900 PA is the square root of (.3)(1-.3)/3900, which comes out to .00733. That's the SD of overall luck.

Since var(overall luck) = var(pitcher luck) + var(fielder luck), we can solve for fielder luck, which turns out to be .00367.


6.36 points pitcher luck (.00636)
3.67 points fielder luck (.00367)
--------------------------------
7.33 points overall luck (.00733)

If you square all the numbers and convert to percentages, you get


75.3 percent pitcher luck
24.7 percent fielder luck
--------------------------
100.0 percent overall luck

So there it is. BAbip luck is, on average, 75 pitching and 25 percent fielding. Of course, it varies randomly around that, but those are the averages.

What does that mean in practice? Suppose you notice that a team from the past, which you know has average talent in both pitching and fielding, gave up 20 fewer hits than expected on balls in play. If you were to go back and watch re-broadcasts of all 162 games, you'd expect to find that the fielders made 5 more plays than expected, based on what types of balls in play they were. And, you'd expect to find that the other 15 plays were the result of balls being having been hit a bit easier to field than average.

Again, we are not estimating talent here: we are estimating *what happened in games*. This is a substitute for actually watching the games and measuring balls in play, or having zone ratings, which are based on someone else actually having done that.

------

So, now that we know the luck breaks down 75/25, we can take our original breakdown, which was this:


6.1 fielding talent
4.3 pitching staff talent
7.3 luck
3.5 park
--------------------------
11.0 total

And split up the 7.3 points of luck as we calculated:


6.36 pitching luck
3.67 fielding luck
--------------------------
7.3 total luck

And substitute that split back in to the original:


6.1 fielding talent
3.67 fielding luck
4.3 pitching staff talent
6.36 pitching staff luck
3.5 park
--------------------------
11.0 total

Since talent+luck = observed performance, and talent and luck are independent, we can consolidate each pair of "talent" and "luck" by summing their squares and taking the square root:


7.1 fielding observed
7.7 pitching observed
3.5 park
----------------------
11.0 total

Squaring, taking percentages, and rounding, we get

42 percent fielding
48 percent pitching
10 percent park
--------------------
100 percent total

If you're playing in an average park, or you're adjusting for park some other way, it doesn't apply here, and you can say


47 percent fielding
53 percent pitching
---------------------
100 percent total

So now we have our answer. If you see a team's stats one year that show them to have been particularly good or bad at turning batted balls into outs, on average, after adjusting for park, 47 percent of the credit goes to the fielders, and 53 percent to the pitchers.

But it varies. Some teams might have been 40/60, or 60/40, or even 120/-20! (The latter result might happen if, say, the fielders saved 24 hits, but the pitchers gave up harder BIPs that cost 4 extra hits.)

How can you know how far a particular team is from the 47/53 average? Watch the games and calculate zone ratings. Or, just rely on someone else's reliable zone rating. Or, start with 47/53, and adjust for what you know about how good the pitching and fielding were, relative to each other. Or, if you don't know, just use 47/53 as your estimate.

To verify empirically whether I got this right, find a bunch of published Zone Ratings that you trust, and see if they work out to about 42 percent of what you'd expect if the entire excess BAbip was allocated to fielding. (I say 42 percent because I assume zone ratings correct for park.)

(Actually, I ran across about five years of data, and tried it, and it came out to 39 percent rather than 42 percent. Maybe I'm a bit off, or it's just random variation, or I'm way off and there's lots of variation.)

-------

So what we've found so far:

-- Luck in BAbip belongs 25% to fielders, 75% to pitchers;

-- For a team-season, excess performance in observed BAbip belongs 42% to fielders, 48% to pitchers, and 10% to park.

-------

That 42 percent figure is for a team-season only. For an individual pitcher, it's different.

Here's the breakdown for an individual pitcher who allows 700 BIP for the season.


6.1 fielding talent
7.6 pitching talent
17.3 luck
3.5 park
---------------------------
20.2 total

The SD of pitching talent is larger now, because you're dealing with one specific pitcher, rather than the average of all the team's pitchers (who will partially offset each other, reducing variability). Also, luck has jumped from 7.3 points to 17.2, because of the smaller sample size.

OK, now let's break up the luck portion again:


6.1 fielding talent
7.6 fielding luck
7.6 pitching talent
15.5 pitching luck
3.5 park
---------------------------
20.2 total

And consolidating:


9.75 observed fielding
17.3 observed pitching
3.5 park
---------------------------
20.2 total

Converting to percentages, and rounding from 31/69:


23% observed fielding
73% observed pitching
3% park
---------------------------
100% total

If we've already adjusted for park, then

24% observed fielding
76% observed pitching
---------------------------
100% total


So it's quite different for an individual pitcher than for a team season, because luck and talent break down differently between pitchers and fielders.

The conclusion: if you know nothing specific about the pitcher, his fielders, his park, or his team, your best guess is that 25 percent of his BAbip (compared to average) came from how well his fielders made plays, and 75 percent of his BAbip comes from what kind of balls in play he gave up.

------

Here's the two-sentence summary. On average,

-- For teams with 3900 BIP, 47 percent of BABIP is fielding and 53 percent is pitching.

-- For starters with 700 BIP, 24 percent of BABIP is fielding and 76 percent is pitching.

------

Next: Part II, where I try applying this to pitcher evaluation, such as WAR.




Labels: , , , ,

posted by Phil Birnbaum @ 12/29/2020 03:56:00 PM 1 comments

Thursday, September 22, 2011

The Bayesian Cy Young

At Fangraphs, Dave Cameron and Eric Seidman have a nice discussion (hat tip: Tango) on who's the better Cy Young candidate: Clayton Kershaw, or Roy Halladay?

Part of the discussion hinges on BABIP: batting average on balls in play. As Voros McCracken discovered years ago, pitchers generally don't differ much in what happens when a non-home-run ball is hit off them. Most of the overall differences between pitchers, then, are due to the fielders behind them, but mostly due to luck.

So far in 2011, Clayton Kershaw has a BABIP of .272, which Eric decribes as "absurdly low." Still, Eric thinks it might actually be skill rather than luck, since since .272 it's not that much different than Kershaw allowed in previous years. Dave argues that Kershaw's three seasons is still a fairly small sample size, and points out that most of his BABIP advantage comes from his record at home (he's about average on the road).

Anyway, my point isn't to weigh in to which one is right -- they do a fine job hashing things out in their discussion. What I want to talk about is something they both seem to agree on: that it's important whether the BABIP is luck or skill. If it's luck, that reduces Kershaw's Cy Young credentials. If it's skill, he's a better candidate.

Seems reasonable, and I don't necessarily disagree. But let's see where that logic leads.

Because, there are other kinds of luck, or factors that pitchers can't control. For instance, there's park (which is usually already adjusted for in WAR, the statistic Eric and Dave cite most in this debate).

There's also quality of opposition batting. It's probably not too hard, if you have good data, to figure out how much either of the pitchers gained by being able to pitch to inferior hitters. You could also check if one of them had the platoon advantage more often. And, if one of them pitched more at home than the other one did.

We'd probably all agree, right, that you'd want to adjust for those kinds of things if we had the information? To be clear, I'm not criticizing Dave or Eric for not spending hours figuring this stuff out. I'm just saying that if you have the data, it's relevant in comparing the pitchers.

There are other things too, that eventually we'll be able to figure out, that we can't right now because (as far as I know) the research hasn't been done. Suppose Kershaw throws a pitch at a certain speed, with a certain break, on a certain count. And, someday, we'll know that kind of pitch is swung on and missed 30% of the time, called a ball 5% of the time, called a strike 10% of the time, fouled off 10% of the time, and hit in play 45% of the time with an OPS of .850. Maybe, overall, that pitch is worth (say) +0.05 runs (in favor of the pitcher).

Once we have that kind of information, we can check for "batter swing luck". If it turns out that batters just randomly happened to go +0.03 on that pitch from Kershaw this season, instead of +0.05, we should credit him the extra 0.02, right? He delivered a certain performance, and the batters just happened to get a bit lucky on it, as if his BABIP was too high. (This measure would probably substitute for BABIP: it includes balls in play, but also home runs, swings-and-misses, and walk potential.)

So we'd adjust Kershaw and Halladay for how lucky the batters were on those swings.

That's not unrealistic, and it'll probably eventually happen, to some degree of accuracy. Here's one that probably won't, at least not for a few decades, but it works as a thought experiment.

Imagine we hook a probe to every batter's brain, so on every pitch we can tell if he's guessing fastball or curve, and if he's guessing inside or outside. After a couple of years of analyzing this data, we figure that when he guesses right, it's worth +0.1 runs (for the batter), when he guesses half-right, it's worth 0, and when he guesses wrong, it's -0.1.

That again, is something out of the control of the pitcher (especially if both batter and pitcher are randomizing using game theory). So you'd want to control for it, right? If Halladay is having a good year just because batters were unlucky enough to guess right only 23% of the time instead of 25%, you have to adjust, just like you'd adjust for a lucky BABIP.

This will change the definition of "batter swing luck," but not replace it. First, the batter may have been lucky enough to guess right, which is worth something. Then, he might have been lucky enough to get better than expected wood on the ball even controlling for the fact that he guessed right.

So you've got lots of sources of luck:

-- park
-- day/night
-- distribution of batters
-- platoon luck
-- BABIP luck
-- batter swing luck
-- batter guess luck

You'd want to adjust for all of these. Right now, as I understand WAR, we're adjusting for park and BABIP.

What about the others? Well, we can't really adjust for those. We *want* to, but we can't.

So, we make do with just park and BABIP. Still, no matter how many decimal places we go to with the debate on Kershaw/Halladay, we're still only going to have our best guess.

At least we can argue that if all the other things are random, we should still be unbiased. Right?

Well, not really. From a Bayesian standpoint, we have a pretty good idea who had more luck. It's much more likely to be Kershaw.

Why? Because Halladay's performance is much more consistent with his career than Kershaw's. Kershaw's a good pitcher, but wasn't expected to be *that* good. Halladay, on the other hand, is having a typical Halladay season. Well, a bit better than typical, but not much.

I'd be willing to bet a lot of money that if you found 50 pitchers who had a better-than-career season, by at least (say) 1.5 WAR, you would find that those 50 pitchers had above-average BABIP luck. It stands to reason. I won't make a full statistical argument, but here's a quick oversimplification of one.

A pitcher can have his talent go up or down from year to year. He can have his luck go up or down from year to year. That's four combinations. Only three of them are possibly consistent with a big improvement in WAR: talent up/luck up; talent up/luck down; talent down/luck up. Two of those have his luck going up. So, two times out of three, the pitcher was lucky.

The argument applies to *all* sources of luck. Even after taking BABIP into account, if a pitcher's adjusted performance is still above his career average, he's still more likely to have had good luck than bad, in other ways (batter swings, say).

I don't have an easy way to quantify this, but still I'd give you better-than-even odds that, stripping out all the above, Halladay is performing better than Kershaw -- even after adjusting for park and BABIP.

If you have two players with similar, outstanding performances, the player with the better expectation of talent is probably the one who's actually having the better year. To believe that Kershaw was really likely to have had a better year than Halladay, you really need him to have put up *much* better numbers. Either that, or you need a way to actually work out all the luck, and prove that the residual still favors Kershaw.

I should emphasize that I am NOT talking about talent here. I think most people would agree that Halladay is still more talented than Kershaw, but would nonetheless argue Kershaw might still be having the better season.

But, what I'm saying is, no, I bet Kershaw is NOT having a better season, even if his numbers look better. I'm saying that it's likely that Kershaw *is actually not pitching better*. If we had the data, it's more likely than not that we'd see that batters are just having bad luck -- not only are they (perhaps) hitting the ball directly to fielders, as BABIP suggests, but they're probably swinging and missing at hittable pitches.

---------

Another way to look at it: if two pitchers have mostly the same results, but one has better stuff, what does that mean? It means that the pitcher with the better stuff must have been unluckier than the pitcher with the worse stuff. In other words, the batters facing the better stuff must have been luckier.

We don't know for sure, of course, that Halladay had better stuff than Kershaw. But history suggests that's more likely. And so, the odds are on the side of Kershaw having been luckier than Halladay. How much so?

I don't know. One mitigating factor is that Kershaw is young, so you'd expect more of his improvement to be real. But, still, a small improvement is more likely than a large improvement, so the odds are still on the side of postive luck over negative luck.

---------

Does that take some of the fun out of the Cy Young? I think it certainly does make it a little bit less entertaining, at least until we have better data. That's because, as long as we remain ignorant of a significant amount of luck, it requires a much bigger hurdle to award the honor to anyone other than Halladay.

This is a bit counterintuitive, but it's true. Suppose a good but not great pitcher -- Matt Cain, say -- has almost exactly the same stat line as Roy Halladay, including BABIP, but is actually better in some categories. Perhaps he a couple of extra strikeouts, and a couple fewer walks.

From the usual arguments, there would be absolutely no debate that Cain's season is better, right? He's better than Halladay in some categories, and the same as Halladay in all the others.

But ... if you're trying to bet on which player actually pitched better after removing all the luck, you'd still have to go with Halladay.

-----

UPDATE: on his blog, Tango writes,

Aside to Phil: Marcel had Kershaw with a 3.07 ERA for 2011, and Halladay at 3.04. So, while you make great points in your article, you didn’t have the right examples! Sabathia and Verlander would have been better examples.

Oops! I'll just leave it the way it is for now, but point taken.

Labels: , , , , ,

posted by Phil Birnbaum @ 9/22/2011 12:54:00 PM 3 comments

Saturday, July 16, 2011

Minority pitchers succeed with fewer called strikes

I'm scheduled to talk about umpires and racial bias in a couple of weeks at JSM in Miami. I was hoping not to have to repeat same old things I've been talking about for the last few years, so I decided to see if there's anything new I could find. And I think I've got something, maybe. Well, I thought I had something, and it's interesting, but I now think it might be a false alarm with respect to umpires and race.

First, a quick review (and I promise it'll be quick). The Hamermesh study of racial bias (.pdf) was based on a chart that looked like this:


Pitcher ------ White Hspnc Black
--------------------------------
White Umpire-- 31.88 31.27 31.27
Hspnc Umpire-- 31.41 32.47 28.29
Black Umpire-- 31.22 31.21 32.52
--------------------------------
All Umpires –- 31.83 31.30 31.32


The numbers are the percentage of called pitches (not swung at) that the umpire called a strike. This chart is based only on low-attendance games (less than 30,000 fans), where the study's authors found the strongest effect. It's my attempt to reproduce their results from Retrosheet data (the authors didn't provide the equivalent chart to what I have here).

If you look at the chart, you will see that, for any race of pitcher or umpire, the largest percentage of pitches are strikes exactly when the race of the umpire matches the race of the pitcher. The original study did a big regression and found that this result is indeed statistically significant. They concluded that umpires are biased in favor of pitchers of their own race (or biased against pitchers of a different race).

That's all I'm going to say about that; if you want to see my arguments, you can go here. Now, I'm going to take a different route.

--------

Let's start by ignoring umpires for now, and just looking at the pitchers. The bottom row of the chart shows the overall called strike percentage of the pitchers. Let me repeat them here for clarity:

31.83 percent strikes -- white pitchers
31.30 percent strikes -- hispanic pitchers
31.32 percent strikes -- black pitchers

It looks like there are real differences between the pitchers. Now, it's *possible* that the entire effect is actually caused by biased umpires, but nobody really believes that, including the authors of the original study. Different pitchers have different attributes, and it's probably just that the white pitchers are such that they happen to throw more called strikes than the minority pitchers.

Moreover, it would appear that the white pitchers happen to be *better* than the minority pitchers, since their strike percentage is higher. In fact, I think I may have said this a few times in the past, that the white pitchers were more successful.

I was wrong. Actually, it's the minority pitchers who performed better, *despite* the fact that their called pitches were less likely to be strikes.

Here are the opposition batting records for each of the three groups of pitchers, normalized to 600 PA:

------------AB--H--2B-3B-HR-BB-SO---avg-RC27
--------------------------------------------
White .... 543 147 30 3 17 51 099 0.271 5.02
Hispanic . 541 141 28 3 17 53 108 0.261 4.71
Black .... 546 145 28 3 14 48 106 0.266 4.57

The white pitchers performed the worst, striking out fewer batters and allowing more hits and runs. The last column of the batting record is "runs created per 27 outs."

What's going on? How is it that the minority pitchers did so much better despite having fewer called strikes? My first reaction was this: perhaps the relationship between called strikes and performance is *negative*. That is, maybe having lots of called strikes means you're throwing lots of pitches right down the middle of the plate, and you're getting hammered. Logically possible, right?

But it doesn't seem to be true. I ran a regression of Component ERA vs. Called Strike Percentage for starting pitchers with 100 IP or more, and the relationship goes the way you'd think: the higher the called strikes, the lower the ERA and the more successful the pitcher. In fact, it's a pretty strong relationship: every 0.1 percentage point in called strike percentage (example: from 31.83 percent to 31.93 percent) lowers ERA by 0.11. That's almost exactly what you'd expect knowing that the difference between a ball and a strike is approximately .14 runs.

So how is it that those pitchers bucked the relationship, and had a better performance despite fewer called strikes?

I think I was able to find the answer: they compensated by having more pitches swung at. As it turns out, the benefit of an extra percentage point in pitches swung at is also positive: an increase of 0.1 percent lowers ERA by 0.13 points.

Here are the numbers for pitches swung at:

44.99 percent pitches swung at -- White
45.52 percent pitches swung at -- Hispanic
46.84 percent pitches swung at -- Black

These are large differences, more than comparable to the differences in called strike percentage.

(By the way, keep in mind that the denominators of the two measures are different. Pitches swung at is (swung at and missed + foul balls + put in play) divided by total pitches. Called strike percentage is (called strikes) / (called strikes + balls).)

Here's the same 3x3 chart as earlier, but this time using swinging percentage:

Pitcher ------ White Hspnc Black
--------------------------------
White Umpire-- 45.01 45.51 46.92
Hspnc Umpire-- 44.63 45.61 43.59
Black Umpire-- 44.77 46.05 46.82
--------------------------------
All Umpires –- 44.99 45.52 46.84

Just like in the original chart, the numbers are higher when the umpire's race matches the pitcher's race (with the exception of black pitchers facing white umpires).

Now, I suppose you could argue that these differences, also, could be attributed to umpire bias. It's possible that, knowing that more umpires are biased against them, minority pitchers have to throw down the middle to compensate. That results in batters swinging the bat more.

The problem with that theory is that the minority pitchers *improved* under this (alleged) injustice. If it's really racist bias, shouldn't they have gotten worse? Because, if the racism actually made them compensate in such a way that they got better, why wouldn't they compensate all the time, not just for umpires of the opposite race?

If you want to hold on to the hypothesis that it's umpire bias, you have to assume that the bias backfired, and that the pitchers, in their ignorance, didn't realize that there was a way to pitch better than they were already pitching. That seems farfetched.

---------

So, the minority pitchers have a *lower* percentage of called strikes, but a *higher* percentage of swinging strikes. When I saw that, I thought it might be normal: the more batters swing, the fewer strikes remain to be called by the umpire. But, again, that turns out not to be the case. There's a strong positive relationship between called strike percentage and swinging strike percentage, with a correlation coefficient of .23 (this is for 1,350 starting pitcher seasons of 100+ IP, 2000-2009).

Why, then, are the black and hispanic pitchers bucking the trend? The only thing I can think of is that even though the correlation between called strikes and swinging strikes is positive, maybe there are certain types of pitchers who go the opposite way. For instance, maybe there are three types of pitchers:

1. Pitchers who throw right down the middle. They get a lot of swings, and, when the batter doesn't swing, it's very likely to be a strike.

2. Pitchers with poor control. They don't get a lot of swings, and, when the batter doesn't swing, it's likely to be a ball.

3. Pitchers who normally throw right down the middle, but like to waste pitches frequently (or throw a certain type of pitch that sometimes goes awry). They get a lot of swings, but, when the batter doesn't swing, it's one of those waste pitches and likely to be a ball.

Types 1 and 2 would show a positive correlation between swings and called strikes. Type 3 would show a negative correlation. If there are a lot more types 1 and 2 than type 3, the overall correlation would be positive.

So, maybe black and minority pitchers are more likely to be Type 3. Any other explanations?

---------

BTW, my first reaction was that this all had to do with count. In "Scorecasting," the authors found that umpires were reluctant to call a third strike or a fourth ball on a close pitch. That would explain the observations perfectly, like this: The minority pitchers get more strikeouts. So they get more two-strike pitches. Therefore, they get more batters swinging on those pitches, and also fewer called strikes on those pitches. That's enough to give us the results we saw.

Alas, the beautiful theory doesn't hold up. I reran the tables, but looking only at 0-0 pitches. Again, (a) the minority pitchers had more swings, and (b) on the remaining pitches, the minority pitchers got fewer called strikes. Numbers available on request.

---------

So what is it that the minority pitchers have in common that gives them this unusual combination of low called strikes and high swinging strikes? I don't know, but I bet someone reading this can tell me.

For the ten black pitchers in the study, I looked at their tendencies from 2000 to 2009 (even though the study was only 2004 to 2006). The difference between their swinging strike percentage and their called strike percentage was 16.04, well above the average of 13.60. What is it about them, as a group, that would explain that?

Arthur Rhodes
CC Sabathia
Darren Oliver
Dontrelle Willis
Edwin Jackson
Ian Snell
Jerome Williams
LaTroy Hawkins
Ray King
Tom Gordon

I'd give you the hispanic pitchers -- I think there's about 30 of them -- but I don't have a list handy.

---------

In any case, and getting back to the issue of umpire bias ...

This is where the false alarm comes in. When I saw that a higher called strike percentage means different things for different pitchers, I thought we might have an explanation: rather than the umpires calling more unmerited strikes, maybe it was just those pitchers pursuing a different strategy. Maybe they were occasionally deciding to pitch how the average white pitcher does -- whatever that is -- and getting more called strikes, but without a change in performance.

Alas, that's not true. *Between* races of pitchers, increased called strike percentage didn't mean better performance. But *within* races of pitchers, it did.

Here's the original 3x3 chart, but with RC27 instead of called strike percentage:

Pitcher ------ Whte Hspn Blac
------------------------------
White Umpire-- 4.97 4.77 4.49
Hspnc Umpire-- 5.15 4.59 5.88
Black Umpire-- 5.47 4.20 5.39
-----------------------------
All Umpires –- 5.02 4.71 4.57

With the exception of the bottom-right cell and the bottom-center cell, the RC27 figures match the order of the called strike figures (see the very first chart of this post). It does seem like, as a characteristic of their style, black and hispanic pitchers successfully sacrifice called strikes in exchange for swinging strikes ... but when they *do* get those called strikes from certain umpires, they do even better.

So, pitchers *do* seem to benefit from extra called strikes, once you control for who the pitcher is. So we still have the same problem we had at the beginning.

------

That problem, still, appears to be that when the pitcher was hispanic, hispanic umpires called around 40 too many strikes out of 2,864 called pitches.

40 pitches doesn't seem like a lot over three years ... but it's only over the equivalent of about 30 or 40 team-games (1,349 PA). I don't really see an argument for how those 40 pitches could have been miscalled. It can't be anything the original study controlled for ... like home/road, starter/reliever, score, identity of the pitcher, etc. It would have to be an interaction of some of those things. Like, for instance, pitcher A throws a lot of inside sliders, and umpire B likes to call those strikes, and B happened to randomly umpire a lot of A's games.

But I don't see how the numbers work out. It's still 40 pitches in 30 games. With three hispanic umpires and 30 hispanic pitchers, that's 90 possible combinations. Some are more likely than others -- we're only looking at pitchers in front of 30,000 or fewer fans, which concentrates them a bit among certain teams -- but still, 90 combinations over 30 games makes it unlikely that one or two pairs would dominate to the tune of 40 pitches.

So, I thought I had an explanation ... but, after all this, I don't think I do. I still suspect that the result is just random, and not racial bias or any other explanation, but ... that's just my opinion.

Still, I need to think some more. Now that we know that more called strikes does not *always* lead to improved performance, and that it depends on the pitcher ... can you see any arguments that I'm missing, for what else might be happening?

-------

UPDATE: OK, one more theory I thought of. Suppose pitcher style varies from game to game. Take, for instance, a hispanic pitcher. Some games, he pitches one way, and gets few called strikes and lots of swinging strikes. Other games, and independently of the umpire, he consciously decides to pitch differently, and he gets more called strikes and fewer swinging strikes.

In that case, pitches are no longer independent -- it's *games* that are independent. That means that you have to use a different statistical technique, like cluster sampling. The bottom line, there, is that the SD goes way up. The results stay the same, but the confidence interval widens and the statistical significance disappears.

So, if there's evidence that pitchers' expected percentages change on a game-by-game basis (that is, the *expectations* have to change due to pitcher behavior, not just the outcome of the game fluctuating because of random variation), that probably negates the statistical significance, which is the only reason to suspect umpire bias.


Labels: , , ,

posted by Phil Birnbaum @ 7/16/2011 04:15:00 PM 7 comments

Monday, August 09, 2010

Do pitchers perform worse after a high-pitch start?

Last week, J.C. Bradbury and Sean Forman released a study to check whether throwing a lot of pitches affects a pitcher's next start. The paper, along with a series of PowerPoint slides, can be found at JC's blog, here.

There were several things that the study checked, but I'm going to concentrate on one part of it, which is fairly representative of the whole.

The authors tried to predict a starting pitcher's ERA in the following game, based on how many pitches he threw this game, and a bunch of other variables. Specifically:

-- number of pitches
-- number of days rest
-- the pitcher's ERA in this same season
-- the pitcher's age

It turned out that, controlling for the other three factors, every additional pitch thrown this game led to a .007 increase in ERA the next game.

Except that, I think there's a problem.

The authors included season ERA in their list of controls. That's because they needed a way to control for the quality of the pitcher. Otherwise, they'd probably find that throwing a lot of pitches today means you'll pitch well next time -- since the pitcher who throws 130 pitches today is more likely to be Nolan Ryan than Josh Towers.

So, effectively, they're comparing every pitcher to himself that season.

But if you compare a pitcher to himself that season, then it's guaranteed that an above-average game (for that pitcher) will be more likely to be followed by a below-average game (for that pitcher). After all, the entire set of games has to add up to "average" for that pitcher.

This is easiest to see if you consider the case where the pitcher only starts two games. If the first game is below his average, the second game absolutely must be above his average. And if the first game is above his average, the second game must be below.

The same thing holds for pitchers with more than two starts. Suppose a starter throws 150 innings, and gives up 75 runs, for an ERA of 4.50. And suppose that, today, he throws a 125-pitch complete game shutout.

For all games other than this one, his record will be 141 innings and 75 earned runs, for a 4.79 ERA. So, in his next start, you'd expect him, in retrospect, to be significantly worse than his season average of 4.50. That difference isn't caused by the 125 pitches. It's just the logical consequence that if this game was above the season average, the other games combined must be below the season average.

Now, high pitch counts are associated with above-average games, and low pitch counts are associated with bad starts. So, since a player should be below average after a good start, and a high pitch start was probably a very good start, then it follows that a player should be below his average after a high pitch start. Similarly, he should be above his average after a low-pitch start. That's just an artifact of the way the study was designed, and has nothing to do with the player's arm being tired or not.

How big is the effect over the entire study? I checked. For every starter from 2000 to 2009 starting on less than 15 days rest, I computed how much his ERA would have been higher or lower had that start been eliminated completely. Then I grouped the starts in groups, by number of pitches. The results:

05-14: -0.09
15-24: -0.41
25-34: -0.50
35-44: -0.64
45-54: -0.50
55-64: -0.38
65-74: -0.19
75-84: -0.08
85-94: +0.01
95-104: +0.05
105-114: +0.06
115-124: +0.07
125-134: +0.07
135-144: +0.08
145-154: +0.06

(Note: even though I'm talking about ERA, I included unearned runs too. I really should say "RA", but I'll occasionally keep on saying "ERA" anyway just to keep the discussion easier to follow. Just remember: JC/Sean's data is really ERA, and mine is really RA.)

To read one line off the chart: if you randomly found a game in which a starter threw only 50 pitches, and eliminated that game from his record, his season ERA would drop by half a run, 0.50. That's because a 50-pitch start is probably a bad outing, so eliminating it is a big improvement.

That's pretty big. A pitcher with an ERA of 4.00 *including* that bad outing might be 3.50 in all other games. And so, if he actually pitches to an ERA of around 3.50 in his next start, that would be just as expected by the logic of the calculations.

It's also interesting to note that the effect is very steep up to about 90 pitches, and then it levels off. That's probably because, after 90, any subsequent pitches are more a consequence of the pitcher's perceived ability to handle the workload, and less the number of runs he's giving up on this particular day.

Finally, if you take the "if this game were omitted" ERA difference in every game, and regress it against the number of pitches, what do you get? You'll get that every extra pitch causes an .006 increase in ERA next game -- very close to the .007 that JC and Sean found in their study.

-----

So, that's an argument that suggests the result might be just due to the methodology, and not to arm fatigue at all. To be more certain, I decided to try to reproduce the result. I ran a regression to predict next game's ERA from this game's pitches, and the pitcher's season ERA (the same variables JC and Sean used, but without age and year, which weren't found to be significant). I used roughly the same database they did -- 1988 to 2009.

My result: every extra pitch was worth .005 of ERA next game. That's a bit smaller than the .007 the authors found (more so when you consider that theirs really is ERA, and mine includes unearned runs), but still consistent. (I should mention that the original study didn't do a straight-line linear regression like I did -- the authors investigated transformations that might have wound up with a curved line as best fit. However, their graph shows a line that's almost straight -- I had to hold a ruler to it to notice a slight curve -- so it seems to me that the results are indeed similar.)

Then, I ran the same regression, but, this time, to remove the flaw, I used the pitcher's ERA for that season but adjusted *to not include that particular game*. So, for instance, in the 50-pitch example above, I used 3.50 instead of 4.00.

Now, the results went the other way! In this regression, every additional pitch this game led to a .003 *decrease* in runs allowed next game. Moreover, the result was only barely statistically significant (p=.07).

So, there appears to be a much weaker relationship between pitch count and future performance when you choose a better version of ERA, one that's independent of the other variables in the regression.

However, there's still some bias there, and there's one more correction we can make. Let me explain.

-----

In 2002, Mike Mussina allowed 103 runs in 215.2 innings of work, for an RA of 4.30.

Suppose you took one of Mussina's 2005 starts, at random. On average, what should his RA that game be?

The answer is NOT 4.30. It's much higher. It's 4.89. That is, if you take Mussina's RA for every one of his 33 starts, and you average all those numbers out, you get 4.89.

Why? Because the ERA calculation, the 4.30, is when you weight all Mussina's innings equally. But, when we wonder about his average ERA in a game, we're wanting to treat all *games* equally, not innings. The July 31 game, where he pitched only 3 innings and had an RA of 21.00, gets the same weight in the per-game-average as his 9-inning shutout of August 28, with an RA of 0.00.

In ERA, the 0.00 gets three times the weight of the 9.00, because it covered three times as many innings. But when we ask about ERA in a given game, we're ignoring innings, and just looking at games. So the 0.00 gets only equal weight to the 9.00, not three times.

Since pitchers tend to pitch more innings in games where they pitch better, ERA gives a greater weight to those games. And that's why overall ERA is lower than averaging individual games' ERAs.

The point is: The study is trying to predict ERA for the next game. The best estimate for ERA next game is *not* the ERA for the season. That's because, as we just saw, the overall season ERA is too low to be a proper estimate of a single game's ERA. Rather, the best estimate of a game's ERA is the overall average of the individual game ERAs.

So, in the regression, instead of using plain ERA as one of the dependent variables, why not use the player's average game ERA that season? That would be more consistent with what we're trying to predict. In our Mussina example, instead of using 4.30, we'll use 4.89.

With the exception, of course, that we'll subtract out the current game from the average game ERA. So, if we're working on predicting the game after Mussina's shutout, we'll use the average game ERA from Mussina's other 32 starts, not including the shutout. Instead of 4.89, that works out to 5.04.

That is, I again ran a regression, trying to predict the next game's RA based on:

-- pitches thrown this game
-- pitcher's average game ERA this season for all games excluding this one.

When I did that, what happened?

The effect of pitches thrown disappeared, almost entirely. It went down to -.0004 in ERA, and wasn't even close to significant (p=.79). Basically, the number of pitches thrown had no effect at all on the next start.

-----

So I think what JC and Sean found is not at all related to arm fatigue. It's just a consequence of the fact that their model retroactively required all the starts to add up to zero, relative to that pitcher's season average. And so, when one start is positive, the other starts simply have to work out to be negative, to cancel out. That makes it look like a good start causes a bad start, which makes it look like a high-pitch start causes a bad start.

But that's not true. And, as it turns out, when we correct for the zero-sum situation, the entire effect disappears. And so it doesn't look to me like pitches thrown has any connection to subsequent performance.


UPDATE: I took JC/Sean's regression and added one additional predictor variable -- ERA in the first game, the game corresponding to the number of pitches.

Once you control for ERA that game, the number of pitches became completely non-significant (p=.94), and its effect on ERA was pretty much zero (-0.00014).

That is: if you give up the same number of runs in two complete games, but one game takes you 90 pitches, and the other takes you 130 pitches ... well, there's effectively no difference in how well you'll pitch the following game.

That is strongly supportive of the theory that number of pitches is significant in the study's regression only because it acts as a proxy for runs allowed.


Labels: , ,

posted by Phil Birnbaum @ 8/09/2010 09:29:00 AM 16 comments

blogspot stats

AltStyle によって変換されたページ (->オリジナル) /