**Title:** Predicting Strikeouts Using Velocity and Whiffs

**Date:** May 8, 2013

**Original Source:** Beyond the Boxscore

**Synopsis:** This article attempted to create a predictive model for pitcher strikeouts.

Back in October, I took a look at how different factors correlate with Strikeout Percentage (K%) to determine what areas may be predictive. The conclusion was that missing bats — or swinging strike percentage (SwStr%) — is all that really stands out as a singular, predictive trait.

(Technically, BP’s Whiff/Swing was slightly more predictive last time out. By converting SwStr% to “Whiff%” we can approximate this pretty well, though due to some differences between FanGraphs and BP numbers, they won’t be exactly the same.)

## Background

I wanted to revisit this study as Part One of a four-part look at how strikeouts and walks can be predicted. After all, since 2006, more than a quarter of all plate appearances have ended in one of these two outcomes, highlighting the rise of Three True Outcomes in baseball.

Today, I’ll revisit “Pitcher Strikeout Rates Explained,” following up with a look at how we can predict pitcher walk rate, and then both outcomes on the hitter side. The goals here are to better understand these non-in-play plate appearances as they continue to grab a larger share of outcomes, as well as identify over- and under-performers for fantasy purposes, eventually.

#### Method

Using FanGraphs’ custom leaderboards, I ran regressions for pitcher seasons from 2006 to 2012 where pitchers had at least 350 batters faced (a better cut-off criteria than innings pitched). I used 2006, as that’s sort of the dawning of the “modern era” of Three True Outcomes – before then, there were seasons that touched 25 percent of all plate appearances, but since then it has remained above that level. We’ve gotta cut it off somewhere! This gives us 1189 pitcher seasons to examine.

I compared strikeout rate (K/PA) to a handful of potential indicators: SwStr%, Whiff%, overall pitches in the zone (Zone%), overall swing rate (Swing%), swing rate on pitches outside the zone (O-Swing%), first strike percentage (F-Strike%), fastball frequency (FA%, based on PITCHf/x), fastball velocity (vFA, also based on PITCHf/x), and slider frequency (SL%, based on PITCHf/x). I chose slider as a potential indicator as to whether certain pitch types would be correlated — since slider had the highest pitch value in the sample, I figured if it wasn’t predictive, no pitch would be (I didn’t feel the need to test all pitch types).

#### Results

Category | R2 |

Whiff | 0.695 |

SwStr% | 0.676 |

F-Strike% | 0.190 |

vFA (pfx) | 0.180 |

O-Swing% | 0.129 |

Swing% | 0.029 |

Zone% | 0.014 |

SL% (pfx) | 0.013 |

FA% (pfx) | 0.011 |

UBB% | 0.008 |

Whiff/vFA | 0.709 |

Just like last time, we see that Whiff% is our best indicator for K%, with an R-squared of nearly .7, meaning that Whiff% explains just shy of 70 percent of the variance in K% for the league’s qualified pitchers. The only other factors that tell us anything at all are first strike percentage, fastball velocity, and the ability to get swings-and-misses on pitches outside of the zone (which, really, is captured at least in part by Whiff%). However, if we were to combine Whiff%, first strike rate and fastball velocity into one metric, it would have an R-squared of .737, hardly more than Whiff% alone.

From here, we can take the formula that our regression spits out and develop an Expected Strikeout Percentage (xK%), using Whiff% to explain 69.5% of the value, with the league average K% for that season making up the remaining 30.5%. The table below shows some of the “luckier” and “unluckier” pitchers in the sample based on xK%-K% differential.

Season | Name | Team | IP | K% | Whiff% | xK% | xK%-K% |

2007 | Erik Bedard | Orioles | 182 | 30.20% | 24.34% | 21.60% | -8.60% |

2009 | Phil Hughes | Yankees | 86 | 27.40% | 21.76% | 20.11% | -7.29% |

2011 | Vance Worley | Phillies | 131.2 | 21.50% | 12.91% | 14.25% | -7.25% |

2011 | Cliff Lee | Phillies | 232.2 | 25.90% | 19.46% | 18.72% | -7.18% |

2012 | Stephen Strasburg | Nationals | 159.1 | 30.20% | 25.28% | 23.07% | -7.13% |

2009 | Tim Lincecum | Giants | 225.1 | 28.80% | 24.15% | 21.75% | -7.05% |

2011 | Cory Luebke | Padres | 139.2 | 27.80% | 22.66% | 20.91% | -6.89% |

2012 | Mike Fiers | Brewers | 127.2 | 25.10% | 18.28% | 18.29% | -6.81% |

2009 | Jake Peavy | – – – | 101.2 | 26.80% | 21.61% | 20.01% | -6.79% |

2006 | Ben Sheets | Brewers | 106 | 27.00% | 22.47% | 20.23% | -6.77% |

2007 | Mike Maroth | – – – | 116.1 | 9.30% | 14.32% | 14.75% | 5.45% |

2012 | Jeanmar Gomez | Indians | 90.2 | 11.90% | 17.03% | 17.43% | 5.53% |

2009 | Trevor Cahill | Athletics | 178.2 | 11.60% | 17.41% | 17.14% | 5.54% |

2007 | Lenny DiNardo | Athletics | 131.1 | 10.60% | 16.59% | 16.31% | 5.71% |

2008 | Josh Rupe | Rangers | 89.1 | 13.50% | 20.84% | 19.33% | 5.83% |

2006 | Chien-Ming Wang | Yankees | 218 | 8.40% | 13.72% | 14.25% | 5.85% |

2012 | Aaron Cook | Red Sox | 94 | 4.90% | 7.68% | 11.04% | 6.14% |

2006 | Runelvys Hernandez | Royals | 109.2 | 9.80% | 16.63% | 16.24% | 6.44% |

2012 | Derek Lowe | – – – | 142.2 | 8.60% | 13.61% | 15.09% | 6.49% |

2009 | Shairon Martis | Nationals | 85.2 | 9.00% | 15.78% | 16.02% | 7.02% |

Here we see that 2007 Erik Bedard was the ‘luckiest,’ in that he got the highest K% that wasn’t backed up by an equally strong Whiff%. In fact, we see several “extremely high K% seasons” on the list, which makes sense – those type of seasons of extreme performance are less likely to be sustainable, and thus any regressed formula will call for them to be outliers. Shairon Martis, on the other hand, was the unluckiest in 2009, striking out just nine percent of batters despite having the Whiff% of a pitcher who would expect to strikeout about 16% of batters.

#### As a Predictive Model

In order to see if this xK% would be at all predictive, I ran two regressions and compared them. First, I compared Year-1 K% to Year-2 K%, and then Year-1 xK% to Year-2 K%. If xK% correlates more strongly with future strikeout percentage, it may be useful as a predictive tool moving forward. Our sample was now limited to 669 pitcher seasons, as this filters out all 2012 seasons (since we don’t have enough 2013 data yet) as well as any pitchers who failed to face 350 batters the following season.

Metric | R2 with Year+1 K% |

Year 1 K% | 0.570 |

Year 1 xK% | 0.438 |

Year 1 Whiff% | 0.424 |

Year 1 xxK% | 0.507 |

Year 1 xxxK% | 0.748 |

Unfortunately, xK% does not do a better job predicting the next year’s strikeout rate than strikeout rate alone. Even if we replace the 30.5% league average factor we used initially with the pitcher’s own K% (assuming, perhaps, that the variance is explained partially by some inherent strikeout ability), the R-squared still doesn’t match K% alone.

#### One More Try

Finally, I went back and used Whiff% and vFA together (R-squared of .71 with same-season K%), and used those two to explain 71% of K%, with the league average making up the remaining 29%. And guess what? An R-squared of .75! (This is “xxxK%” in the table above, or my third try at it.)

New xK% = (.71*((.9058*Whiff%)+(.0027*vFA)-.2305))+(.29*LgAvgK%)

*R-squared of .75, meaning this can explain 75% of the variance in the next year’s K%, beating the previous year’s K% of .57.*

The one caveat here is that our sample has been limited to 560 pitcher seasons, accounting for the loss of 2012 data, the attrition of pitchers not facing the minimum batters, and also eliminating pitchers who don’t throw a basic fastball, per PitchFX. Those are some weighty caveats, especially the last one.

**Conclusions**

I still think xK% has some value, and perhaps smarter people than myself can suggest some improvements to make the predictor more effective or based on fewer assumptions. As it stands, the *new* xK% predicted about 75% of the variance in year-after K%. With that in mind, here’s one final table, looking at pitchers who would have been expected to regress in 2013 based on their 2012 numbers, as well as their K% so far.

Season | Name | K% | Whiff% | xK% | xK%-K% | 2013 K% | Change K% |

2012 | Mike Fiers | 25.10% | 18.28% | 18.00% | -7.10% | 2.90% | -22.20% |

2012 | Stephen Strasburg | 30.20% | 25.28% | 24.00% | -6.20% | 23.90% | -6.30% |

2012 | Cliff Lee | 24.40% | 17.68% | 18.33% | -6.07% | 20.10% | -4.30% |

2012 | Brad Lincoln | 24.30% | 17.81% | 18.72% | -5.58% | 20.00% | -4.30% |

2012 | Max Scherzer | 29.40% | 25.52% | 23.85% | -5.55% | 35.10% | 5.70% |

2012 | Marco Estrada | 25.40% | 20.99% | 20.15% | -5.25% | 22.70% | -2.70% |

2012 | David Phelps | 23.20% | 17.57% | 18.07% | -5.13% | 26.50% | 3.30% |

2012 | David Price | 24.50% | 18.86% | 19.82% | -4.68% | 20.80% | -3.70% |

2012 | Gio Gonzalez | 25.20% | 21.23% | 20.92% | -4.28% | 25.00% | -0.20% |

2012 | R.A. Dickey | 24.80% | 24.11% | 20.79% | -4.01% | 18.50% | -6.30% |

2012 | Hector Noesi | 15.00% | 18.93% | 19.23% | 4.23% | 14.60% | -0.40% |

2012 | Blake Beavan | 10.50% | 12.47% | 14.79% | 4.29% | 14.30% | 3.80% |

2012 | Alex White | 13.90% | 17.52% | 18.24% | 4.34% | #N/A | |

2012 | Henderson Alvarez | 9.80% | 11.09% | 14.30% | 4.50% | #N/A | |

2012 | Josh Tomlin | 12.40% | 16.19% | 16.91% | 4.51% | #N/A | |

2012 | Dallas Keuchel | 10.10% | 13.16% | 14.73% | 4.63% | 12.80% | 2.70% |

2012 | Nick Blackburn | 9.20% | 11.38% | 13.99% | 4.79% | #N/A | |

2012 | Josh Roenicke | 14.10% | 19.21% | 19.48% | 5.38% | 16.90% | 2.80% |

2012 | Jeanmar Gomez | 11.90% | 17.03% | 17.60% | 5.70% | 15.80% | 3.90% |

2012 | Aaron Cook | 4.90% | 7.68% | 11.51% | 6.61% | #N/A |

We see that most of last year’s extreme outliers have seen their K%s regress as we’d expect, although of course it’s extremely early still.

Finally, I’d like to apologize for the wordiness and length of this piece – I was finding new things out as I went, so some of it was written as my method developed.

It was also pointed out to me as I was finishing that Michael Barr tackled this with a similar methodology for Fangraphs+. I didn’t read the piece until after mine was done, so my apologies to Michael if there is any unnecessary overlap. His also narrowly beats mine with an R-square of .76, but perhaps this is because his was at the start of 2012 and mine has a different data set.

Any suggestions for improvement or criticisms?