Predicting Strikeouts Using Velocity and Whiffs

Title: Predicting Strikeouts Using Velocity and Whiffs
Date: May 8, 2013
Original Source: Beyond the Boxscore
Synopsis: This article attempted to create a predictive model for pitcher strikeouts.

Back in October, I took a look at how different factors correlate with Strikeout Percentage (K%) to determine what areas may be predictive. The conclusion was that missing bats — or swinging strike percentage (SwStr%) — is all that really stands out as a singular, predictive trait.

(Technically, BP’s Whiff/Swing was slightly more predictive last time out. By converting SwStr% to “Whiff%” we can approximate this pretty well, though due to some differences between FanGraphs and BP numbers, they won’t be exactly the same.)


I wanted to revisit this study as Part One of a four-part look at how strikeouts and walks can be predicted. After all, since 2006, more than a quarter of all plate appearances have ended in one of these two outcomes, highlighting the rise of Three True Outcomes in baseball.

Today, I’ll revisit “Pitcher Strikeout Rates Explained,” following up with a look at how we can predict pitcher walk rate, and then both outcomes on the hitter side. The goals here are to better understand these non-in-play plate appearances as they continue to grab a larger share of outcomes, as well as identify over- and under-performers for fantasy purposes, eventually.


Using FanGraphs’ custom leaderboards, I ran regressions for pitcher seasons from 2006 to 2012 where pitchers had at least 350 batters faced (a better cut-off criteria than innings pitched). I used 2006, as that’s sort of the dawning of the “modern era” of Three True Outcomes – before then, there were seasons that touched 25 percent of all plate appearances, but since then it has remained above that level. We’ve gotta cut it off somewhere! This gives us 1189 pitcher seasons to examine.

I compared strikeout rate (K/PA) to a handful of potential indicators: SwStr%, Whiff%, overall pitches in the zone (Zone%), overall swing rate (Swing%), swing rate on pitches outside the zone (O-Swing%), first strike percentage (F-Strike%), fastball frequency (FA%, based on PITCHf/x), fastball velocity (vFA, also based on PITCHf/x), and slider frequency (SL%, based on PITCHf/x). I chose slider as a potential indicator as to whether certain pitch types would be correlated — since slider had the highest pitch value in the sample, I figured if it wasn’t predictive, no pitch would be (I didn’t feel the need to test all pitch types).


Category R2
Whiff 0.695
SwStr% 0.676
F-Strike% 0.190
vFA (pfx) 0.180
O-Swing% 0.129
Swing% 0.029
Zone% 0.014
SL% (pfx) 0.013
FA% (pfx) 0.011
UBB% 0.008
Whiff/vFA 0.709

Just like last time, we see that Whiff% is our best indicator for K%, with an R-squared of nearly .7, meaning that Whiff% explains just shy of 70 percent of the variance in K% for the league’s qualified pitchers. The only other factors that tell us anything at all are first strike percentage, fastball velocity, and the ability to get swings-and-misses on pitches outside of the zone (which, really, is captured at least in part by Whiff%). However, if we were to combine Whiff%, first strike rate and fastball velocity into one metric, it would have an R-squared of .737, hardly more than Whiff% alone.

From here, we can take the formula that our regression spits out and develop an Expected Strikeout Percentage (xK%), using Whiff% to explain 69.5% of the value, with the league average K% for that season making up the remaining 30.5%. The table below shows some of the “luckier” and “unluckier” pitchers in the sample based on xK%-K% differential.

Season Name Team IP K% Whiff% xK% xK%-K%
2007 Erik Bedard Orioles 182 30.20% 24.34% 21.60% -8.60%
2009 Phil Hughes Yankees 86 27.40% 21.76% 20.11% -7.29%
2011 Vance Worley Phillies 131.2 21.50% 12.91% 14.25% -7.25%
2011 Cliff Lee Phillies 232.2 25.90% 19.46% 18.72% -7.18%
2012 Stephen Strasburg Nationals 159.1 30.20% 25.28% 23.07% -7.13%
2009 Tim Lincecum Giants 225.1 28.80% 24.15% 21.75% -7.05%
2011 Cory Luebke Padres 139.2 27.80% 22.66% 20.91% -6.89%
2012 Mike Fiers Brewers 127.2 25.10% 18.28% 18.29% -6.81%
2009 Jake Peavy – – – 101.2 26.80% 21.61% 20.01% -6.79%
2006 Ben Sheets Brewers 106 27.00% 22.47% 20.23% -6.77%
2007 Mike Maroth – – – 116.1 9.30% 14.32% 14.75% 5.45%
2012 Jeanmar Gomez Indians 90.2 11.90% 17.03% 17.43% 5.53%
2009 Trevor Cahill Athletics 178.2 11.60% 17.41% 17.14% 5.54%
2007 Lenny DiNardo Athletics 131.1 10.60% 16.59% 16.31% 5.71%
2008 Josh Rupe Rangers 89.1 13.50% 20.84% 19.33% 5.83%
2006 Chien-Ming Wang Yankees 218 8.40% 13.72% 14.25% 5.85%
2012 Aaron Cook Red Sox 94 4.90% 7.68% 11.04% 6.14%
2006 Runelvys Hernandez Royals 109.2 9.80% 16.63% 16.24% 6.44%
2012 Derek Lowe – – – 142.2 8.60% 13.61% 15.09% 6.49%
2009 Shairon Martis Nationals 85.2 9.00% 15.78% 16.02% 7.02%

Here we see that 2007 Erik Bedard was the ‘luckiest,’ in that he got the highest K% that wasn’t backed up by an equally strong Whiff%. In fact, we see several “extremely high K% seasons” on the list, which makes sense – those type of seasons of extreme performance are less likely to be sustainable, and thus any regressed formula will call for them to be outliers. Shairon Martis, on the other hand, was the unluckiest in 2009, striking out just nine percent of batters despite having the Whiff% of a pitcher who would expect to strikeout about 16% of batters.

As a Predictive Model

In order to see if this xK% would be at all predictive, I ran two regressions and compared them. First, I compared Year-1 K% to Year-2 K%, and then Year-1 xK% to Year-2 K%. If xK% correlates more strongly with future strikeout percentage, it may be useful as a predictive tool moving forward. Our sample was now limited to 669 pitcher seasons, as this filters out all 2012 seasons (since we don’t have enough 2013 data yet) as well as any pitchers who failed to face 350 batters the following season.

Metric R2 with Year+1 K%
Year 1 K% 0.570
Year 1 xK% 0.438
Year 1 Whiff% 0.424
Year 1 xxK% 0.507
Year 1 xxxK% 0.748

Unfortunately, xK% does not do a better job predicting the next year’s strikeout rate than strikeout rate alone. Even if we replace the 30.5% league average factor we used initially with the pitcher’s own K% (assuming, perhaps, that the variance is explained partially by some inherent strikeout ability), the R-squared still doesn’t match K% alone.

One More Try

Finally, I went back and used Whiff% and vFA together (R-squared of .71 with same-season K%), and used those two to explain 71% of K%, with the league average making up the remaining 29%. And guess what? An R-squared of .75! (This is “xxxK%” in the table above, or my third try at it.)

New xK% = (.71*((.9058*Whiff%)+(.0027*vFA)-.2305))+(.29*LgAvgK%)

R-squared of .75, meaning this can explain 75% of the variance in the next year’s K%, beating the previous year’s K% of .57.

The one caveat here is that our sample has been limited to 560 pitcher seasons, accounting for the loss of 2012 data, the attrition of pitchers not facing the minimum batters, and also eliminating pitchers who don’t throw a basic fastball, per PitchFX. Those are some weighty caveats, especially the last one.


I still think xK% has some value, and perhaps smarter people than myself can suggest some improvements to make the predictor more effective or based on fewer assumptions. As it stands, the new xK% predicted about 75% of the variance in year-after K%. With that in mind, here’s one final table, looking at pitchers who would have been expected to regress in 2013 based on their 2012 numbers, as well as their K% so far.

Season Name K% Whiff% xK% xK%-K% 2013 K% Change K%
2012 Mike Fiers 25.10% 18.28% 18.00% -7.10% 2.90% -22.20%
2012 Stephen Strasburg 30.20% 25.28% 24.00% -6.20% 23.90% -6.30%
2012 Cliff Lee 24.40% 17.68% 18.33% -6.07% 20.10% -4.30%
2012 Brad Lincoln 24.30% 17.81% 18.72% -5.58% 20.00% -4.30%
2012 Max Scherzer 29.40% 25.52% 23.85% -5.55% 35.10% 5.70%
2012 Marco Estrada 25.40% 20.99% 20.15% -5.25% 22.70% -2.70%
2012 David Phelps 23.20% 17.57% 18.07% -5.13% 26.50% 3.30%
2012 David Price 24.50% 18.86% 19.82% -4.68% 20.80% -3.70%
2012 Gio Gonzalez 25.20% 21.23% 20.92% -4.28% 25.00% -0.20%
2012 R.A. Dickey 24.80% 24.11% 20.79% -4.01% 18.50% -6.30%
2012 Hector Noesi 15.00% 18.93% 19.23% 4.23% 14.60% -0.40%
2012 Blake Beavan 10.50% 12.47% 14.79% 4.29% 14.30% 3.80%
2012 Alex White 13.90% 17.52% 18.24% 4.34% #N/A
2012 Henderson Alvarez 9.80% 11.09% 14.30% 4.50% #N/A
2012 Josh Tomlin 12.40% 16.19% 16.91% 4.51% #N/A
2012 Dallas Keuchel 10.10% 13.16% 14.73% 4.63% 12.80% 2.70%
2012 Nick Blackburn 9.20% 11.38% 13.99% 4.79% #N/A
2012 Josh Roenicke 14.10% 19.21% 19.48% 5.38% 16.90% 2.80%
2012 Jeanmar Gomez 11.90% 17.03% 17.60% 5.70% 15.80% 3.90%
2012 Aaron Cook 4.90% 7.68% 11.51% 6.61% #N/A

We see that most of last year’s extreme outliers have seen their K%s regress as we’d expect, although of course it’s extremely early still.

Finally, I’d like to apologize for the wordiness and length of this piece – I was finding new things out as I went, so some of it was written as my method developed.

It was also pointed out to me as I was finishing that Michael Barr tackled this with a similar methodology for Fangraphs+. I didn’t read the piece until after mine was done, so my apologies to Michael if there is any unnecessary overlap. His also narrowly beats mine with an R-square of .76, but perhaps this is because his was at the start of 2012 and mine has a different data set.

Any suggestions for improvement or criticisms?


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: