Posted by: Peter | July 24, 2009

Hang ‘em high

I mentioned the other day that Anthony Swarzak seemed to throw all his pitches up in the strike zone. Today I thought I’d see if I could confirm that intuition.

The pitchf/x system includez a variable called pz, which indicates how far off the ground a pitch is, in feet, when it crosses home plate. I looked at all the pitchers who had thrown at least as many pitches as Swarzak this year (which as of his last start is 663), and calculated the average height of their pitches. This is a pretty crude measure, since it doesn’t distinguish between fastballs and breaking balls, or correct for the height of the batter, or anything like that. Still, it’s a good rough-and-ready measure of how much a guy pitches up or down in the zone.

The bottom 10–that is, the guys whose pitches are lowest on average–are:

Name                 Pitches Avg. Height
Peter Moylan          676    1.949021
Derek Lowe           2059    1.968820
Brad Ziegler          693    2.036996
Armando Galarraga    1650    2.053054
Todd Coffey           717    2.053136
Joel Pineiro         1655    2.065794
Ian Snell            1413    2.078838
Billy Buckner         697    2.104546
Joel Hanrahan         733    2.119492
Matt Albers           674    2.129280

Not too surprising to see sinkerballers and submariners on this list. And the top 10, whose pitches fly the highest?

Name                 Pitches Avg. Height
Anthony Swarzak       663    2.910861
David Aardsma         782    2.795220
Justin Verlander     2170    2.760634
Chris Young          1299    2.758913
Rich Hill            1064    2.754677
Clayton Kershaw      1886    2.744506
Russ Springer         680    2.735875
Kevin Millwood       2186    2.734052
Barry Zito           1894    2.721184
Scott Baker          1783    2.719218
J.A. Happ            1477    2.717607

Evidently, it was not my imagination. This dude is leaving pitches up, in a big way.  And the gap between him and the next guy on the list is really big–almost an inch and a half. Here’s a picture to give a sense of what these numbers mean. It’s the average pitch height for Swarzak, the guy right behind him on the list, the guy right in the middle of the list, and the lowest-throwing guy on the list.

swarzakheight0723

I’m still not sure what all this means, but I can’t believe it’s sustainable.

For years now, we’ve heard people complaining about how sabermetrics and data analysis are taking the fun out of the game, and ruining it for everyone who just wants to enjoy the fun and the myth and the romance of it, without all those grubby numbers.

But to me, the objective and aesthetic sides of appreciating the game are linked. Below are a couple of examples of ways that examining the numbers helps me the beauty of this game, beyond the wins and losses.

Beating the odds

Looking over my post on Joe Mauer and hitting .400, it occurred to me that there was one other question I should have addressed. I noted how unlikely it is, in any given season, that somebody will hit .400. But we might also ask: what are the chances that, in all the years since Ted Williams did it, nobody got to .400 even once? To answer this just means combining all the year-by-year probabilities to figure out the chances that someone, sometime after 1941, would get to a specified batting average. Without going into the boring details, here’s the answer:

probs1941

What I love about this graph is how serendipitously wonderful it is, in terms of the aesthetics of being a baseball fan. Mathematically, there’s nothing special about the number .4, it’s just a number. But to baseball fans, with our oddly numerological fixations, it seems like a magical threshold. So what good fortune for it to turn out that, when you run the probabilities, things turn out like they do above. To hit .380 is hard, but someone was liable to do it sometime since 1941, just by chance. To hit .390 is much harder, improbable even–but the fact that someone has managed to do it is merely surprising, not awe-inspiring. And on the other side, to hit .410 or .420 is so outlandishly improbable that we’ll almost certainly never see it (although on outlandish improbability, see below.)

But .400 is something else altogether. It’s sitting right there on the edge of possibility. Maybe someday, someone will get there–but if they do, it will mean literally “beating the odds”, doing something truly improbable. In other words, the “statistical significance” of .400, as we say in the jargon of data analysis, is commensurate with the emotional significance we place on it as fans.

And now for something completely differentimprobable

Finally, I want to return to the wonderful Stephen Jay Gould one more time. In another of his writings on baseball, Gould noted that most of the feats we celebrate in the game are actually not that statistically surprising–with one important exception:

Nothing ever happened in baseball above and beyond the frequency predicted by coin-tossing models. The longest runs of wins or losses are as long as they should be, and occur about as often as they ought to. Even the hapless Orioles, at 0 and 21 to start this season, only fell victim to the laws of probability (and not to the vengeful God of racism, out to punish major league baseball’s only black manager).

But “treasure your exceptions,” as the old motto goes. There is one major exception, and absolutely only one—one sequence so many standard deviations above the expected distribution that it should not have occurred at all. Joe DiMaggio’s fifty-six–game hitting streak in 1941. The intuition of baseball aficionados has been vindicated. Purcell calculated that to make it likely (probability greater than 50 percent) that a run of even fifty games will occur once in the history of baseball up to now (and fifty-six is a lot more than fifty in this kind of league), baseball’s rosters would have to include either four lifetime .400 batters or fifty-two lifetime .350 batters over careers of one thousand games. In actuality, only three men have lifetime batting averages in excess of .350, and no one is anywhere near .400 (Ty Cobb at .367, Rogers Hornsby at .358, and Shoeless Joe Jackson at .356). DiMaggio’s streak is the most extraordinary thing that ever happened in American sports. He sits on the shoulders of two bearers—mythology and science. For Joe DiMaggio accomplished what no other ballplayer has done. He beat the hardest taskmaster of all, a woman who makes Nolan Ryan’s fastball look like a cantaloupe in slow motion—Lady Luck.

Here’s another graph to put this one in perspective. Let’s take a couple of hypothetical hitters. Assume that the “true” talent level of one is that of a .300 hitter, while the other is really a .350 hitter. The latter would be a very, very, good hitter, of course, which should make it easier to sustain a hitting streak.

Now assume these guys get five official at-bats, every single night. This isn’t very realistic either, in that it doesn’t account for walks, and injuries, and low-offense games, and so on. So again, this assumption should make it easier to sustain a streak.

So given these assumptions, what’s the chance of sustaining a long streak? Again, it’s simple probabilities, calculated with the binomial distribution. If you’re a true .300 hitter, your chance of getting a hit is .3 for any given at bat. The chance of getting at least one hit over 5 at-bats is 1-(.7*.7*.7*.7*.7)=0.83, or 83% (.7 is the chance that you won’t get a hit on one at bat, and you multiply those probabilities to get the chance of going hitless over all five at-bats.) The chance of getting a hit in back-to-back games is 0.83*0.83=0.69 or 69%, and so on.

So without further ado:

probhitstreak

This is what Gould was talking about. You’re getting into crazy territory after 30 games, much less 50. That streak isn’t just improbable. It’s wildly, absurdly, inhumanly impossible. No-one should ever have been able to do it. (It’s a 0.003% chance even for the .350 hitter, if you’re wondering.) And so most likely, nobody will ever do it again.

I have to admit, I didn’t care that much about Joe DiMaggio before I learned about all this. I mean, I’m not a Yankee fan, and all this happened before I was born–hell,  before my father was born. But looking at those mind-boggling numbers, I find myself impressed that anyone could do what he did. This isn’t about Moneyball–hitting streaks don’t mean much, in terms of wins and losses. Rather, it goes to the part of the game that’s just beautiful and wonderful, and not entirely rational. The part, that is, that makes us want to watch it.

Posted by: Peter | July 21, 2009

Hitting the big time

Well, that was cool. I was certainly hoping to catch some attention with the last post, but I never expected such effusive praise from the godfather of Twins-blogging himself! Of course, now I feel like I actually need to start posting here regularly. Hopefully posts will be somewhat more common than .400 hitters from here on out.

I am not in any way interested in talking about what happened last night, though. So let’s look ahead instead. Kevin Slowey is still hurting, so we’re going to get another Anthony Swarzak start. I thought I’d dig into my pitchf/x database and see what Swarzak’s six starts so far this year have looked like. Below are charts of where his pitches crossed the plate, seen from the catcher’s perspective. The colors correspond to pitch type, classified using a cluster analysis of my own devising. I think Swarzak is actually supposed to have two fastballs, but I lumped them all together because it’s too hard to distinguish them without doing some data corrections that I don’t have time for right now.  B, S and X refers to pitches that ended up as ball, strikes, or in play (for an out or a hit):

swarzakscout0720

My first reaction was: egads, this guy loves to hang out in the top of the strike zone! “Keeping the ball up” is not generally regarded as a winning strategy in major league baseball, so this is not a good sign for the future.

Next, just for fun and because I figured out how to do it, I thought I’d make plots of what Swarzak’s pitches look like coming in to a right-handed hitter, and to a lefty. See here for more on these plots.

swarzaklr0720

I love these plots, because they give those of us who don’t actually play the game some sense of what the mysterious “platoon split” is about. To a right-handed batter, Swarzak’s three pitches look very similar coming in, whereas a left-hander can easily pick up the path of the curveball and off-speed pitches.

Of course, so far this year Swarzak has had a crazy reverse split: a 1.63 WHIP against righties as against only 1.25 against lefties, which would seem to make a mockery of the above analysis. But that’s why they warn you about small sample sizes: I don’t think he’s going to sustain that any more than I think he’s going to keep sneaking those chest-high fastballs by hitters.

On the positive side for Swarzak, he’s going from getting shellacked by the AL’s best offense (826 OPS) to facing its worst (698 OPS). So maybe his luck will last for at least one more game.

Posted by: Peter | July 9, 2009

Could Joe Mauer hit .400?

Well, of course he could. But what we (and by we, I mean baseball nerds with way too much time  on our hands) want to know is, how likely is it that Mauer will hit .400?

The conventional wisdom is that it’s no longer possible to hit .400–or at least it’s much more difficult than it used to be. The absence of any .400 hitters since Ted Williams would seem to confirm that diagnosis. But John Bonnes, the Twins Geek, has a provocative post up today arguing that in fact, it’s getting easier to hit .400. His evidence is simply that of all the players who have come close to .400 since Williams, the majority have done it in the past 15 years. On top of that, some–like George Brett and Tony Gwynn–have come within a few hits of the achievement. On the basis of this observation, the Geek says that maybe Mauer has more of a shot than we think.

This is, on the face of it, intriguing evidence. But the Geek is a numerically astute guy, and so I was a little disappointed that he didn’t mention a sabermetric classic on this topic: the late Paleontologist Stephen Jay Gould’s essay on the dissappearance of the .400 hitter (that’s not a link to the actual essay, which I couldn’t find online). Gould was arguing against people who thought that the decline of .400 hitting was due to the declining quality of hitters. Gould argued that paradoxically, the decline of .400 hitting was due to the fact that all players were actually getting better. Because the general level of play was higher, it was more difficult for any player to be so far ahead of the pack that they could hit .400.

Gould supported this argument by showing that batting averages had become less variable. That is, there are both fewer really good hitters and fewer really bad hitters, becaue everyone is more bunched together within a narrower of batting averages. When he did this analysis originally, back in the 1980’s, he painstakingly put together the data by hand, while he was laid up in bed recovering from an illness. But today, of course, we have the statistics at our fingertips. So, using data from the Lahman database, I thought I’d extend Gould’s analysis and see what it has to say about the Twins Geek’s hypothesis.

All the graphs below are based only on players who meet the modern definition of qualifying for a batting title: 3.1 plate appearances per team games played. This wasn’t actually the rule used prior to 1957, but I applied it anyway for simplicity.

First off, here’s a picture that demonstrates what I mean when I talk about the decreasing variation in batting averages. It’s a comparison of the distribution of batting averages in 1900 and in 2000.

avgdist

You can see that the hitters in 1900 were more spread out than the hitters in 2000. There are more hitters with really high averages, but also more hitters with really low averages. Even though the average hitter had a higher batting average in 1900 (signified by the peak of the curve being farther to the right), there were still more hitters down around the .200 mark (the red line is above the black line at the left end.) Back in 1900, a “good glove, no hit” infielder could still find a starting job in a way he couldn’t today.

To get a general idea of how batting average has become less variable over time, we can look at the standard deviation of batting average by season. The standard deviation essentially measures how spread out the distribution of batting averages is. (Technically, it measures the average distance from the mean.) The higher the standard deviation, the more spread out the batting averages are.  See this graph, which is adapted from one that Gould originally produced:

sdavgtrend

You can see that up through the early 1980’s, when Gould’s analysis was done, batting average was becoming less and less variable. This happened even as the average level of batting average bounced around between “pitcher-friendly” and “hitter-friendly” eras:

meanavgtrend

But if you look at those graphs, you’ll notice that something happened in the ’90’s: batting averages went up overall, and the variation in averages also went up. Whether that was because of expansion, or the steroid era, or whatever, I can’t say. But that’s not what I’m interested in explaining. I want to know how easy it is to hit .400 these days. Higher batting averages + more variability should equal a better chance of a .400 hitter. But how much better?

Fortunately, it’s possible to get an answer to this question that’s at least reasonably precise. If you look back at the first graph, you’ll see that the batting averages of all the qualifying hitters in any one season approximate a bell curve, or what’s called a normal distribution. And the nice thing about things that are normally distributed is that we can predict the probability that a normally distributed variable will take on any particular value. If we know what the average of all batting averages is, and we know the standard deviation of batting averages, we can predict the probability that a particular player will hit .400 or above.

This means that we can predict the probability of hitting .400 in each year. In order to smooth out year-to-year fluctuations in the mean and standard deviation of batting averages, I took the average of the previous five years. Then I calculated, for each year, the probability that some hitter would hit .400 or better that year. For comparison, I also calculated the probabilities for hitting .380 and .390. Keep in mind this is the probability of any hitter getting to .400 (based on the number of people who qualified for the batting title that year), not the probability of any one particular hitter doing it.

probavg

The first thing I have to say here is that Twins Geek was really onto something. The chances of somebody hitting .400 jumped  up in the last 15 years, to levels not seen since before World War II.

That’s the good news for Joe Mauer. The bad news is that this trend seems to have reversed itself, and things are back to the way they were in the 1980’s. If you look at the charts above, you’ll see that this is not necessarily because averages have come down overall (although they have, some), but because they’ve become less variable, more bunched together.

The other bad news for Joe, of course, is that even in the batting bonanza of the late 1990’s, the chances that anyone would hit .400 were never even 0.5%. It’s just a really hard thing to do. Which is all the more reason that despite all that I’ve said here, I’m going to keep on watching and rooting for Mauer along with the Twins Geek.

Posted by: Peter | April 13, 2009

Game 8: Ick.

Well, that wasn’t much fun. Kevin Slowey had a bad night, and judging from the pitch f/x data, the culprit was his curveball and changeup. Or the lack thereof:

Kevin Slowey's pitch locations

Kevin Slowey's pitch locations

These filled circles are triangles, and the triangles are changeups.  Slowey threw a grand total of six curveballs and four changeups. And it’s probably just as well that he didn’t throw any more, since five of those ten pitches were turned into hits. But the fact remains that Slowey doesn’t have overpowering stuff, so he won’t be very successful if he has to rely on only his fastball and his slider.

Hopefully Slowey will fiind the feel for those offspeed pitches the next time around. Things look less good for Toronto’s Jesse Litsch, who came out of the game in the fourth inning with an arm injury of uncertain severity. Watching the game, it appeared that Litsch suddenly experienced some kind of pain and then immediately took himself out of the game. But look at this plot of the spin and speed of his pitches, by inning:

Jesse Litsch's pitches

Jesse Litsch's pitches

It looks like something was already wrong with Litsch after the first inning–his velocity went down, his slider wasn’t breaking as hard, and his fastball had a different spin, more like a changeup. Maybe that explains why he got hit so hard.

Posted by: Peter | September 30, 2008

One Pitch

That’s what the Twins season comes down to: one pitch. Specifically, Nick Blackburn’s 76th pitch, which Jim Thome blasted over the centerfield wall to score the only run of game #163. Below is the entire at-bat, as it appeared from the perspective of the umpire, with estimated pitch trajectories calculated from pitch f/x:

Jim Thome's fateful at-bat in Twins game number 163

Jim Thome goes deep

The pitch was a hanging changeup, and it richly deserved to be hit a long way. Still, how heartbreaking to have the season end because of that one mistake.

Posted by: Peter | September 26, 2008

Series of the Year

Obviously, sweeping the White Sox was huge for the Twins’ season. But how huge? I decided to run a little simulation, to see what the team’s chances of going to the playoffs was before and after this series. For most of the relevant games (the Twins-Royals series, the Sox-Indians series, the possible Sox-Tigers makeup game) I assumed each game is a toss-up, with each team having a 50-50 chance of winning. That’s obviously not quite right, what with pitcher matchups, home-field advantage, the difference in opponent quality, and so on, but it’s a good rough guide. To account for the fact that the Twins-Sox series was at the dome and the one-game playoff would be at US Cellular, I gave the twins a 60% chance of winning the home games, but only a 40% chance of winning a playoff on the road.

Granted, this isn’t as complex or as accurate as something like Baseball Prospectus’s Postseason Odds Report, but the overall playoff odds that come from my quick-and-dirty method are pretty similar. And doing my own simulation allows me to look a little deeper into the likely scenarios, beyond what the BP page shows.

So after simulating the end of the season 100,000 times, here’s what I got:

Before the sweep:

  • Probaility of the Twins making the playoffs: 19%
  • Probablity that the Sox have to play their makeup game: 27%
  • Probability that the season is decided by a one-game playoff: 13%
  • Most probable outcome: Sox win the division by 1.5 games (18%)

And after the sweep?

  • Probaility of the Twins making the playoffs: 61%
  • Probablity that the Sox have to play their makeup game: 55%
  • Probability that the season is decided by a one-game playoff: 27%
  • Most probable outcome: Twins win the division by one game (28%)

To say that this series saved the Twins season is putting it lightly.

Posted by: Peter | July 24, 2008

Youth Movement

Dave Studeman has an article in the Hardball Times today asking whether veteran players “know how to win”. Unsurprisingly, at least to me, there’s no evidence that older players are somehow better at handling the pressure of pennant races. But what really caught my attention was the appendix to his article reproduced below:

References and Resources
As a reference, here is a list of each team’s Win Shares Age this year. Win Shares age is essentially a team’s age weighted by the contribution of each player (as measured by Win Shares). There are several youth movements to note: the Giants are 3.2 years younger than last year’s team, the Twins are 2.5 years younger, the Dodgers are 2.3 years younger and the Rangers are 2.1 years younger.

Team        WSAge
MIN         25.7
TB          25.7
ARI         26.2
OAK         26.4
FLA         26.7
WAS         27.0
LAN         27.0
CLE         27.2
KC          27.2
TEX         27.2
ATL         27.3
PIT         27.4
LAA         27.4
COL         27.4
MIL         27.8
CIN         27.9
BAL         28.0
SF          28.1
STL         28.1
SEA         28.3
CHA         28.5
NYN         28.6
SD          28.7
BOS         28.7
CHN         28.9
DET         29.2
PHI         29.2
TOR         29.3
HOU         30.8
NYA         31.6

This “Win Shares Age” statistic is a little convoluted, but as best I can tell it measures: a) how young a team is; and b) how much the younger players are contributing to the team’s on-field success. And the table shows that the Twins are tied with the Rays for the title of “best young team”. And that’s even with Carlos Gomez batting leadoff for half the year. I have to say, this makes me optimistic about the next few seasons, even if I’m still skeptical that this is a playoff year for the Twins.

Also: oh boy, were the Giants ever old last year! They had the biggest youth movement in the majors and they’re still in the older half of the league.

Posted by: Peter | May 17, 2008

Announcifying

I’m watching the Twins-Rockies game on mlb.tv, and man is this Colorado broadcast team awful. An inning ago, the color guy talked over the play-by-play guy so he could finish some pointless story about eating on the road. And now they just got themselves totally confused about which Hernandez brother is which.

Announcer 1: “El Duque is ten years younger than him.”

Announcer 2: “What? What did you just say? This guy is ten years older than El Duque? He’s, like, 61″.

They went around like this for a while until they figured out that they had the whole thing ass backwards. Truly terrible. I didn’t think a TV baseball crew could be this bad without involving Hawk Harrelson.

For the record, the Detroit broadcast team is the best one I’ve seen while watching mlb.tv. That color guy really teaches you things about the game.

Also, I’ve always wondered when someone would do webstreaming alternative play-by-play that you could run concurrently with the video feed of a game while muting the official announcers. I feel like a lot of people would be psyched about that.

Posted by: Peter | May 10, 2008

Gomez Update: Walk On

I had already finished and posted my Carlos Gomez analysis when the G-man went ahead and became the winning run in tonight’s exciting game. So of course I couldn’t go to sleep without writing up a coda on this game.

As Dick & Bert noted, this was only Gomez’s fourth walk of the year(!), so that in itself is noteworthy. Beyond that, though, I wondered: how did Gomez’s performance accord with my analysis? Any good scientist will tell you, after all, that one of the best tests of a model is how well it fits new data.

First, here’s a plot of all the pitches Gomez saw tonight:

After having just spent all day with Gomez’s pitch data, this graph immediately looked really weird to me. So much so, in fact, that I went and loaded up Gameday just to make sure I hadn’t plotted the data wrong.

What’s so odd? Well, the Red Sox decided to pitch Gomez inside tonight. And if you look back at the pitch plots in my earlier post, you’ll see that virtually no-one has done that this year. It’s been away, away, away.

Other than that, though, tonight mostly seems in keeping with my analysis of Gomez’s recent transformation into a better hitter. He laid off the pitches low and away, just as he has done since April 23rd. And the Red Sox didn’t give him much offspeed stuff, which is also consistent with the recent data. When Gomez did swing at pitches out of the strike zone, they were high pitches–again, consistent with what we’ve seen lately.

But the big news, of course, was that our man worked a walk. Here’s how he did it against Jonathan Papelbon. He got nothing but fastballs; he fouled off the ones in the strike zone, and he let the other ones go for balls. Simple as that. Observe how it’s done (red means foul, black means ball):

Of course, Carlos’s new plate discipline could still be a fluke. But you can’t help but love this at bat!

Meanwhile, the other hero of the game was Mike Lamb, who blooped a single to bring in the winning runs. That was a welcome change from what Lamb has done for most of this season, which is make tons and tons of outs. In fact, just as Lamb was coming up in the ninth inning, I was thinking that I needed to start working out my next in-depth analysis, tentatively titled “Why Does Mike Lamb Suck So Much?” And I’m still planning on doing it. But maybe if we’re lucky, tonight was the beginning of Lamb’s Gomez-like transformation from scrub into impact player.

« Newer Posts - Older Posts »

Categories