Predicting Strikeouts Using Velocity and Whiffs

Title: Predicting Strikeouts Using Velocity and Whiffs
Date: May 8, 2013
Original Source: Beyond the Boxscore
Synopsis: This article attempted to create a predictive model for pitcher strikeouts.

Back in October, I took a look at how different factors correlate with Strikeout Percentage (K%) to determine what areas may be predictive. The conclusion was that missing bats — or swinging strike percentage (SwStr%) — is all that really stands out as a singular, predictive trait.

(Technically, BP’s Whiff/Swing was slightly more predictive last time out. By converting SwStr% to “Whiff%” we can approximate this pretty well, though due to some differences between FanGraphs and BP numbers, they won’t be exactly the same.)

Background

I wanted to revisit this study as Part One of a four-part look at how strikeouts and walks can be predicted. After all, since 2006, more than a quarter of all plate appearances have ended in one of these two outcomes, highlighting the rise of Three True Outcomes in baseball.

Today, I’ll revisit “Pitcher Strikeout Rates Explained,” following up with a look at how we can predict pitcher walk rate, and then both outcomes on the hitter side. The goals here are to better understand these non-in-play plate appearances as they continue to grab a larger share of outcomes, as well as identify over- and under-performers for fantasy purposes, eventually.

Method

Using FanGraphs’ custom leaderboards, I ran regressions for pitcher seasons from 2006 to 2012 where pitchers had at least 350 batters faced (a better cut-off criteria than innings pitched). I used 2006, as that’s sort of the dawning of the “modern era” of Three True Outcomes – before then, there were seasons that touched 25 percent of all plate appearances, but since then it has remained above that level. We’ve gotta cut it off somewhere! This gives us 1189 pitcher seasons to examine.

I compared strikeout rate (K/PA) to a handful of potential indicators: SwStr%, Whiff%, overall pitches in the zone (Zone%), overall swing rate (Swing%), swing rate on pitches outside the zone (O-Swing%), first strike percentage (F-Strike%), fastball frequency (FA%, based on PITCHf/x), fastball velocity (vFA, also based on PITCHf/x), and slider frequency (SL%, based on PITCHf/x). I chose slider as a potential indicator as to whether certain pitch types would be correlated — since slider had the highest pitch value in the sample, I figured if it wasn’t predictive, no pitch would be (I didn’t feel the need to test all pitch types).

Results

Category	R2
Whiff	0.695
SwStr%	0.676
F-Strike%	0.190
vFA (pfx)	0.180
O-Swing%	0.129
Swing%	0.029
Zone%	0.014
SL% (pfx)	0.013
FA% (pfx)	0.011
UBB%	0.008
Whiff/vFA	0.709

Just like last time, we see that Whiff% is our best indicator for K%, with an R-squared of nearly .7, meaning that Whiff% explains just shy of 70 percent of the variance in K% for the league’s qualified pitchers. The only other factors that tell us anything at all are first strike percentage, fastball velocity, and the ability to get swings-and-misses on pitches outside of the zone (which, really, is captured at least in part by Whiff%). However, if we were to combine Whiff%, first strike rate and fastball velocity into one metric, it would have an R-squared of .737, hardly more than Whiff% alone.

From here, we can take the formula that our regression spits out and develop an Expected Strikeout Percentage (xK%), using Whiff% to explain 69.5% of the value, with the league average K% for that season making up the remaining 30.5%. The table below shows some of the “luckier” and “unluckier” pitchers in the sample based on xK%-K% differential.

Season	Name	Team	IP	K%	Whiff%	xK%	xK%-K%
2007	Erik Bedard	Orioles	182	30.20%	24.34%	21.60%	-8.60%
2009	Phil Hughes	Yankees	86	27.40%	21.76%	20.11%	-7.29%
2011	Vance Worley	Phillies	131.2	21.50%	12.91%	14.25%	-7.25%
2011	Cliff Lee	Phillies	232.2	25.90%	19.46%	18.72%	-7.18%
2012	Stephen Strasburg	Nationals	159.1	30.20%	25.28%	23.07%	-7.13%
2009	Tim Lincecum	Giants	225.1	28.80%	24.15%	21.75%	-7.05%
2011	Cory Luebke	Padres	139.2	27.80%	22.66%	20.91%	-6.89%
2012	Mike Fiers	Brewers	127.2	25.10%	18.28%	18.29%	-6.81%
2009	Jake Peavy	– – –	101.2	26.80%	21.61%	20.01%	-6.79%
2006	Ben Sheets	Brewers	106	27.00%	22.47%	20.23%	-6.77%

2007	Mike Maroth	– – –	116.1	9.30%	14.32%	14.75%	5.45%
2012	Jeanmar Gomez	Indians	90.2	11.90%	17.03%	17.43%	5.53%
2009	Trevor Cahill	Athletics	178.2	11.60%	17.41%	17.14%	5.54%
2007	Lenny DiNardo	Athletics	131.1	10.60%	16.59%	16.31%	5.71%
2008	Josh Rupe	Rangers	89.1	13.50%	20.84%	19.33%	5.83%
2006	Chien-Ming Wang	Yankees	218	8.40%	13.72%	14.25%	5.85%
2012	Aaron Cook	Red Sox	94	4.90%	7.68%	11.04%	6.14%
2006	Runelvys Hernandez	Royals	109.2	9.80%	16.63%	16.24%	6.44%
2012	Derek Lowe	– – –	142.2	8.60%	13.61%	15.09%	6.49%
2009	Shairon Martis	Nationals	85.2	9.00%	15.78%	16.02%	7.02%

Here we see that 2007 Erik Bedard was the ‘luckiest,’ in that he got the highest K% that wasn’t backed up by an equally strong Whiff%. In fact, we see several “extremely high K% seasons” on the list, which makes sense – those type of seasons of extreme performance are less likely to be sustainable, and thus any regressed formula will call for them to be outliers. Shairon Martis, on the other hand, was the unluckiest in 2009, striking out just nine percent of batters despite having the Whiff% of a pitcher who would expect to strikeout about 16% of batters.

As a Predictive Model

In order to see if this xK% would be at all predictive, I ran two regressions and compared them. First, I compared Year-1 K% to Year-2 K%, and then Year-1 xK% to Year-2 K%. If xK% correlates more strongly with future strikeout percentage, it may be useful as a predictive tool moving forward. Our sample was now limited to 669 pitcher seasons, as this filters out all 2012 seasons (since we don’t have enough 2013 data yet) as well as any pitchers who failed to face 350 batters the following season.

Metric	R2 with Year+1 K%
Year 1 K%	0.570
Year 1 xK%	0.438
Year 1 Whiff%	0.424
Year 1 xxK%	0.507
Year 1 xxxK%	0.748

Unfortunately, xK% does not do a better job predicting the next year’s strikeout rate than strikeout rate alone. Even if we replace the 30.5% league average factor we used initially with the pitcher’s own K% (assuming, perhaps, that the variance is explained partially by some inherent strikeout ability), the R-squared still doesn’t match K% alone.

One More Try

Finally, I went back and used Whiff% and vFA together (R-squared of .71 with same-season K%), and used those two to explain 71% of K%, with the league average making up the remaining 29%. And guess what? An R-squared of .75! (This is “xxxK%” in the table above, or my third try at it.)

New xK% = (.71*((.9058*Whiff%)+(.0027*vFA)-.2305))+(.29*LgAvgK%)
R-squared of .75, meaning this can explain 75% of the variance in the next year’s K%, beating the previous year’s K% of .57.

The one caveat here is that our sample has been limited to 560 pitcher seasons, accounting for the loss of 2012 data, the attrition of pitchers not facing the minimum batters, and also eliminating pitchers who don’t throw a basic fastball, per PitchFX. Those are some weighty caveats, especially the last one.

Conclusions

I still think xK% has some value, and perhaps smarter people than myself can suggest some improvements to make the predictor more effective or based on fewer assumptions. As it stands, the new xK% predicted about 75% of the variance in year-after K%. With that in mind, here’s one final table, looking at pitchers who would have been expected to regress in 2013 based on their 2012 numbers, as well as their K% so far.

Season	Name	K%	Whiff%	xK%	xK%-K%	2013 K%	Change K%
2012	Mike Fiers	25.10%	18.28%	18.00%	-7.10%	2.90%	-22.20%
2012	Stephen Strasburg	30.20%	25.28%	24.00%	-6.20%	23.90%	-6.30%
2012	Cliff Lee	24.40%	17.68%	18.33%	-6.07%	20.10%	-4.30%
2012	Brad Lincoln	24.30%	17.81%	18.72%	-5.58%	20.00%	-4.30%
2012	Max Scherzer	29.40%	25.52%	23.85%	-5.55%	35.10%	5.70%
2012	Marco Estrada	25.40%	20.99%	20.15%	-5.25%	22.70%	-2.70%
2012	David Phelps	23.20%	17.57%	18.07%	-5.13%	26.50%	3.30%
2012	David Price	24.50%	18.86%	19.82%	-4.68%	20.80%	-3.70%
2012	Gio Gonzalez	25.20%	21.23%	20.92%	-4.28%	25.00%	-0.20%
2012	R.A. Dickey	24.80%	24.11%	20.79%	-4.01%	18.50%	-6.30%

2012	Hector Noesi	15.00%	18.93%	19.23%	4.23%	14.60%	-0.40%
2012	Blake Beavan	10.50%	12.47%	14.79%	4.29%	14.30%	3.80%
2012	Alex White	13.90%	17.52%	18.24%	4.34%	#N/A
2012	Henderson Alvarez	9.80%	11.09%	14.30%	4.50%	#N/A
2012	Josh Tomlin	12.40%	16.19%	16.91%	4.51%	#N/A
2012	Dallas Keuchel	10.10%	13.16%	14.73%	4.63%	12.80%	2.70%
2012	Nick Blackburn	9.20%	11.38%	13.99%	4.79%	#N/A
2012	Josh Roenicke	14.10%	19.21%	19.48%	5.38%	16.90%	2.80%
2012	Jeanmar Gomez	11.90%	17.03%	17.60%	5.70%	15.80%	3.90%
2012	Aaron Cook	4.90%	7.68%	11.51%	6.61%	#N/A

We see that most of last year’s extreme outliers have seen their K%s regress as we’d expect, although of course it’s extremely early still.

Finally, I’d like to apologize for the wordiness and length of this piece – I was finding new things out as I went, so some of it was written as my method developed.

It was also pointed out to me as I was finishing that Michael Barr tackled this with a similar methodology for Fangraphs+. I didn’t read the piece until after mine was done, so my apologies to Michael if there is any unnecessary overlap. His also narrowly beats mine with an R-square of .76, but perhaps this is because his was at the start of 2012 and mine has a different data set.

Any suggestions for improvement or criticisms?

Blake Murphy Sports Writing

Personal sports writing archive of Blake Murphy.