clock menu more-arrow no yes

Filed under:

Talking Chop Baseball Analysis Primer: Pitching

MLB: NLDS-Atlanta Braves at Los Angeles Dodgers Robert Hanashiro-USA TODAY Sports

If you’re new to the whole baseball analytics thing, pitching valuation might throw you for a loop. For position players, there may be things in WAR that you’ve never considered, but the general ideas are probably pretty similar: outs bad, not-outs good, extra bases good, fielding good, running good. For pitching, though, the paradigm is different.

The touchstone pitching stat, for ages, was ERA. At first glance, ERA seems great -- it purports to tell you, “When this guy is on the mound, here’s how many runs you can expect to score in nine innings against him (unless that pesky defense intervenes in a way that leads to more runs).” That seems pretty useful, right? But, two key principles of analysis intersect rather uncomfortably with ERA. First, we have the premise that players should be credited only for their own contributions, rather than those of their teammates. A pitcher’s ERA isn’t dependent on himself in a vacuum, it’s also dependent on his fielders making outs behind him. That makes it very difficult to say that ERA is all about the pitcher. Second, it’s always important to think about descriptive versus projection-oriented stats. Yes, ERA describes what happened, but how good of a tool is it for forecasting future run prevention? Are there things that are better?

One general analytic idea about pitching is that you can get a better measure of effective pitching by developing some kind of metric that addresses these things. That is, you want a measure that captures only the pitcher’s own contributions, and (either directly because of this, or as a separate component of the measure, or a different measure altogether), is useful for forecasting future run prevention. This general idea culminated in the concept of DIPS -- defense-independent pitching statistics. These types of metrics are the primary measures of pitcher value today.

FIP, xFIP (and xwOBA too)

FIP, which stands for “fielding-independent pitching,” is basically an attempt to recreate the same type of number as ERA (that is, runs allowed per nine innings), but without actually using anything about runs allowed, and instead focusing only on things the pitcher can control. Specifically, this refers to:

  • Walks and hit-by-pitches
  • Strikeouts
  • Home runs

These are the only results that don’t involve any other entities (aside from catcher framing) from either team aside from the pitcher and hitter. (In addition, Fangraphs also includes infield pops as a measure of fWAR because they are so routine for fielders as to basically not be dependent on them, though infield pops aren’t considered as a strikeout in FIP, even though the effect is the same.) The basic idea is that each of these four events has a value (walks, hit-by-pitches, and homers have a positive value, i.e., they contribute to runs allowed; strikeouts have a negative value because they contribute to run prevention), and by multiplying the number of times a pitcher is involved in each event by the value, you get a sense of the pitcher’s overall run prevention aptitude based on things he can control.

Like wRC+, FIP comes in an FIP- flavor that’s scaled to 100. Since the little tailing symbol is a minus and not a plus, that means that an FIP- of 101 reflects a pitcher with an FIP one percent worse (higher) than league average, and an FIP- of 85 reflects a pitcher with an FIP 15 percent better (lower) than league average.

The math of converting FIP to a WAR value, so that you can compare pitcher value to hitter value, is more tortured than all the different hitting components. Going over all of this in detail would take ages, and for those just interested in primer-level basics, I’m not sure it’s necessary. So, consider the bulleted list below a very glossed-over overview:

  • First, start with FIP. To get fWAR, you also need to include infield flies in FIP; infield flies are basically just strikeouts.
  • Scale this infield-fly version of FIP (just going to be referred to FIP for the rest of this bulleted list) to league-average runs allowed per nine innings, just so everything is on the same scale.
  • Adjust FIP for different run environments across the two leagues, and for park (again, just so the comparison is fair across players who play in different leagues and parks).
  • Okay, here’s where it gets really complicated, even though the effect is not huge: remember that a basic rule of thumb is that there are 10 runs in a win? Well, by changing how many runs he allows to score, a pitcher changes how many runs are necessary to win a game he participates in. As a result, the better the pitcher, the fewer runs required to win a game, hence there’s a slight boost in value to better pitchers.
  • Once you’ve done all of the above, what you have is a measure of how many runs above average the pitcher is (based on his FIP, adjusted for league/park), as well as how runs convert to wins when the game involves this pitcher. Since you know both runs above average and the runs-to-wins relationship, you can basically get wins-above-average for the pitcher. Think back to position players, where all the components are calculated as runs (or wins) above average, and then replacement level is just 20 runs below average, meaning that wins above replacement is really just separated from wins above average by 20 runs (per 600 PAs). This is going to be relevant for the next step.
  • So, I threw in that note about 20 runs, but for pitchers, that’s not quite what replacement level is. Rather, there’s more math, and it’s based around the ideas that: (1) a replacement-level reliever is different than a replacement-level starter, in that a replacement-level reliever is worse; and (2) the way to determine what constitutes replacement level is by figuring out how often a team with a replacement-level starter or bullpen effort would win a game where everything else was average. Again, the math isn’t really worth getting into, I think the important part is just the idea that yes, even pitchers are compared to replacement level (that’s what WAR is), and replacement-level performance differs. The effect of this is that relievers and starters are compared against different baselines.
  • After all of this, the wins per game above replacement is scaled so that it’s actually just on a wins basis (and not a wins per game basis) by multiplying it by the number of innings the pitcher has pitched, divided by nine. This isn’t really interesting. Moving on.
  • One important thing here: fWAR includes kind of a fudge factor for relievers based on the leverage in which they pitch. The general idea is that this fudge factor intensifies a reliever’s WAR, such that a reliever who mostly pitches high leverage and has positive WAR gets a boost, and the same reliever with negative WAR would get a penalty. It’s basically a straight multiplication factor. In theory, this could explain why similar relievers may have different fWARs. In practice, though, it doesn’t make a very large difference, in part because relievers pitch so few innings that their fWAR is usually fractional anyway. The logic behind this multiplier or fudge factor is that a reliever’s value isn’t only associated with the outs he gets, but that he allows you to deploy your other relievers to get other outs in other situations. He’s not a lone warrior, but a link in a chain. So the better relievers get credit for the fact that other relievers can be used in less-dire situations, and worse relievers get lower value for this. I’m not sure I’m a huge fan of this adjustment, because I think I kind of violates some of the implicit WAR principles about crediting players for individual achievements only, but it’s so minor that I can’t get too worried about it.
  • Lastly, there’s a boring, tiny correction to make sure that total pitcher WAR sums to 430 in a season, which needs to happen because the original calculus didn’t start with distributing 430 WAR innately across any set of innings or pitchers.

Annnyywaaayyyy… that’s how you get fWAR. To summarize, fWAR for pitchers is basically what happens if you give pitchers credit for strikeouts and infield pops, and subtract credit for walks, hit by pitches, and homers. All the other events when the team is in the field are credited to (or debited from) fielders.

As a result of this, there are many things that currently happen to pitchers that FIP doesn’t care about. FIP doesn’t care about actual runs scored, for one thing. The pitcher may not be responsible for them, so they’re not considered. It also doesn’t care about sequencing. In a baseball game, walk, walk, homer, strikeout, strikeout, strikeout is a really bad inning for a pitcher. Meanwhile, homer, walk, walk, strikeout, strikeout, strikeout is not great, for sure, but not disastrous. FIP considers these to be the same, other measures will look on these two innings very differently. At this point, you might be going, “Wait, that’s dumb. The pitcher was definitely responsible for those three runs. He walked those guys ahead of the homer, and he allowed the homer. One inning was definitely worse than the other.” I’m actually not sure there’s a great response to this point as stated. I think the fallback or point of cover for this is that pitcher strand rates tend to be fairly random, that is, not really in a pitcher’s control. I realize that sounds kind of wild, because I just said above that walks, homers, and strikeouts are in a pitcher’s control, and now I’m saying that the order of them is not. But, what we see is that pitchers don’t really tend to differ in the order in which they allow things to happen, they only tend to differ in the frequency of those things. The order just ends up being kind of random in the grand scheme of things. With all this said, if someone could somehow create a pitching effectiveness measure that was still fielding-independent but factored in sequencing, that’d be interesting for sure.

The kerfuffle with sequencing aside, there’s another thing to know about FIP. In FIP, the penalty for allowing a homer is more than six times greater than the benefit a pitcher gets for striking a guy out. It’s more than four times greater than the penalty for walking a guy. Homers have a really big effect on FIP. In general, if you were looking at a pitcher over the long haul, this wouldn’t matter. Maybe the pitcher would allow a ton of homers and be awful. Or maybe the pitcher would actually be pitching really well, but just so happened to allow a smattering of homers anyway. His FIP would still rise, but for the sake of forward-looking predictions, you may not be inclined to hold those homers against him. For example, if I told you that a pitcher was only going to give up two homers all year, and that happened in his first game of the year, you wouldn’t look at his FIP and say, “Hey, this guy is awful,” because as the season went on, he wouldn’t allow any more homers. The reality is that pitcher homer rates depend in large part on something obvious: how many fly balls they allow. More fly balls equals more chances for homers. To address the problem that a pitcher might just get unlucky with how many homers he allowed in a short span, there’s something called xFIP (which you can think of as expected FIP). xFIP is really simple. It’s basically just FIP, but instead of actively debiting the pitcher for homers allowed, instead the pitcher gets debited for how many homers he would have allowed if an average proportion of his fly balls became homers. In small samples, a pitcher may have a sparkly or gnarly FIP, but his xFIP will give you insight into what his FIP would have been, had he experienced an average rate of fly balls leaving the yard.

There’s a lot here about FIP and xFIP, but one thing that hasn’t been touched on is why you should care. That’s perhaps burying the lede, but that’s going to be rectified now. It’s hard to answer that question more succinctly or emphatically than the table below, I think.

There’s a lot going on in this table. The way to read it as “what percent of [the column heading] in a given year is predicted by [the row heading] in the past year.” So, if you go to the very bottom table, where pitchers with 150 innings in two consecutive years were used, you get this:

  • ERA in the past year correlates with ERA in a future year by just six percent;
  • FIP in the past year correlates with ERA in a future year by 18 percent, three times as well as ERA does;
  • xFIP is similar to FIP, at 17 percent; and
  • xwOBA (which I adjusted for league here, but not for park) is a bit worse, at 14 percent or so.

Here are my main takeaways from this set of tables. First, and perhaps most importantly, ERA is always the worst at predicting future ERA. It’s not just a little worse, either, as it’s consistently half as effective (or even less!) effective as other metrics at predicting ERA. Second, if all you care about is predicting ERA, xFIP tends to be really good (relatively, more on that in a sec), unless you’re dealing with hefty sample sizes, where it’s closer to FIP. My understanding is that this occurs because xFIP is a blunt instrument, it just assumes every pitcher will have a league-average HR/FB rate. But, in reality, some pitchers manage fly ball contact to some extent. Or they might play in parks that suppress/allow homers that differ from the rate. Or they throw hard, which tends to manage contact.

Next, we also see that for most samples, xFIP actually tends to predict everything really well. If you don’t have much to go off of, xFIP will not just tell you about the pitcher’s future run prevention, but his FIP, and his xwOBA. Only when the sample gets fairly large do the metrics kind of splinter off, with FIP best predicting future FIP, xFIP best predicting future xFIP, and xwOBA best predicting future xwOBA. Notably, though, ERA never best predicts ERA.

If you’re familiar with correlation measures (the one here is a simple r-squared), you might be wondering why the percentages displayed are so low. And that’s actually a great question! For ERA, they’re markedly low -- for any sample, the best you get is xFIP predicting about a quarter of future ERA. This is really because ERA is inherently unstable and driven by things other than the pitcher, namely fielding and sequencing (the same reason why ERA isn’t a great measure of pitcher skill in the first place). So even though xFIP predicts ERA better than other stuff, that doesn’t mean it predicts it well. The correlations are better for larger samples, but we’re still not anywhere near FIP or xFIP correlating with the same number one year later to a degree of, say, 60 percent, or better. Pitchers age, they get hampered by injuries, baseball happens. FIP and xFIP are leaps and bounds better than relying on ERA, but they’re not perfect. Player projection requires consideration of more than just one ERA estimator.

xwOBA makes a lot of sense as a forecasting measure for hitters, so you might be a little thrown by its relatively poor performance in the tables above. Further investigation shows that pitchers do tend to control (to some extent, some is on the batter, too!) the exit velocity they allow, but the specific launch angles pitchers allow tend to bounce around period to period. As a result, xwOBA doesn’t stick to itself, and doesn’t provide a lot of insight on a forecasting basis (though still more than ERA). The fact that xFIP predicts future xwOBA better than xwOBA itself in multiple innings cutoffs strongly suggests that if you want to look at one statistic to determine what a pitcher might do in the future, xFIP should get the nod, at least among the ones discussed here.

One final note: recall that fWAR is based on the pitcher’s FIP, while bWAR is based on the pitcher’s runs allowed, adjusted for team defense. While WAR is not really meant to be a forecasting stat in and of itself, what this means is that bWAR doesn’t stick to itself (i.e., predict itself in the future) as well as fWAR does, and that fWAR probably tells you more about a pitcher’s future bWAR than bWAR does. Combined with the fact that bWAR isn’t really a direct measure of a pitcher’s contribution to his team anyway, I’m not too sure how much use it really has. Fangraphs publishes a separate WAR metric, called RA9-WAR. This is basically the straight valuation of the pitcher’s ratio of runs allowed, with no adjustment for defense. If you’re really dead-set on considering a pitcher’s runs allowed rate as value, you can use either, depending on how much you care about the defensive adjustments done for bWAR.

SIERA, DRA and cFIP

ERA, FIP, xFIP, and xwOBA tend to be the ERA estimators that are discussed most often. But, there are some others, and they’re all worth a glance too.

SIERA stands for “Skill-Interactive ERA.” It’s a lot more complicated than something like FIP or xFIP, but its predictive power has generally been observed to be only a little better than xFIP. If you were really gunning for predictive value, SIERA might be worth looking at on that basis alone. But because it’s so complicated, it’s tough to quickly summarize. The main difference between SIERA and FIP/xFIP is that in SIERA, things are less linear. SIERA’s penalty for walks increases as a pitcher walks more batters. SIERA also likes pitchers who get more grounders (or more fly balls) as a perceived deliberate strategy, because groundball pitchers tend to get groundballs that are more easily fielded, and fly ball pitchers tend to have lower HR/FB rates than average (because if they didn’t, they’d be awful). One downside to current use of SIERA -- the figures in it are already park- and league-adjusted, but they’re not set to 100 for average. As a result, it can be kind of hard to tell whether a guy has a good SIERA or a bad one. You have to compare players against one another that way or build some kind of percentile distribution. Having a SIERA- easily available would be great.

DRA and cFIP are published by Baseball Prospectus. The concepts behind them are not too different to DRC+ for hitting, and the best way I can think of describing them is that they contextualize the events that occurred when a pitcher was on the mound in order to credit or debit him accordingly and come up with what his runs allowed should have been. In some ways, this is similar to what FIP and xFIP try to do; in other ways, it sort of isn’t. Once again, FIP and xFIP tell you a very specific thing that’s easy to put into words, while you can’t summarize all the different adjustments and calculations DRA makes in any easy fashion. Like DRC+, it’s worth a peek at it for another take on how effective a pitcher has been. Just be aware that beyond the number itself, it can be a little confusing to explain why a pitcher has one DRA or another.

cFIP, meanwhile, is the combination of the principles in DRA, i.e., contextualizing everything, and the FIP formula. So, it’s basically a way of judging pitching performance on the basis of contextualized strikeouts, walks, hit-by-pitches, and homers allowed. Whereas FIP counts every homer the same (and then is adjusted for park and league to calculate FIP-), cFIP doesn’t, because it adjusts the value of each based on the batter, umpire, catcher, park, and so on and so forth. It is my understanding that in the same way that FIP is more predictive than ERA even though it focuses on fewer inputs, cFIP is more predictive than DRA. In the early 2010s, cFIP was considerably better at predicting future pitcher performance than many other metrics (see here: https://www.fangraphs.com/tht/fip-in-context/). While I don’t know if this still holds, it does suggest it’s worth a long look if you are specifically concerned about forecasting.