August 2007

Note: links to updated versions of scripts and database structure are found at the end of this post. Also, I highly recommend using XAMPP (or Mac equivalent) to install all this software with one easy installer rather than installing Perl, MySQL, etc., piecemeal.

One of my hopes for this project is to share the pitch data in a way that facilitates the analysis of others. There are many others out there whose knowledge of statistics, data analysis, graphics, physics, and/or baseball far exceeds mine.

My pitch database itself is too large a file to easily share, so I’ll do the next best thing, what I hope may be a better thing after all, to try to document the process I used to create it so that others can create a pitch database for themselves. This post may be an ongoing work as I both recreate what I have already done and lay out what work remains for me to do.

First of all, if you just want to get your feet wet using Microsoft Excel to analyze a single game’s worth of pitch-by-pitch XML data, Dr. Alan Nathan has laid out the steps for you at his Physics of Baseball site.

If you desire a whole season’s worth of pitch data (over half a million pitches) stored in a relational database, with visions of all sorts of wonderful analysis that would enable, follow me!

Downloading the Data

The first place to start is with downloading the XML data from Major League Baseball’s Gameday website. But if we’re going to download thousands of games, each with hundreds of pitches, that’s not something we want to do manually. Fortunately, we can leverage a very useful book by Joseph Adler, Baseball Hacks, published by O’Reilly in January 2006. Some parts of his hacks are outdated or nonfunctional, but other parts I found very useful for this project.

Whether or not you want to buy the book, the Perl scripts he wrote for the book can be downloaded from the examples section of O’Reilly’s website. Download and uncompress the file. It contains all the scripts from the book, divided by chapter and hack number. The first script of interest for XML downloading is, found in Chapter 3. This script can be used after only a few minor modifications, as follows:

To download the 2007 season starting with April 2, change Line 35 from
$start = timelocal(0,0,0,20,6,105);
$start = timelocal(0,0,0,2,3,107);

Similarly, you need to change the end time. Uncomment Line 40 and comment Line 41:
$now = timelocal(0,0,0,$mday - 1,$mon,$year);
#$now = timelocal(0,0,0,3,10,105);

These statements determines the first and last dates between which the XML game files will be downloaded. The statements use the function timelocal(seconds, minutes, hours, days, months, years) where days is the day of the month 1-31, months is the month 0-11 (January=0, December=11), and years are since 1900 (2007 = 107).

Another change is to correct for the fact that MLB now puts player information in an XML file rather than a TXT file.

Change the players.txt in Lines 95, 96, 101, and 102 to players.xml:
if($gamehtml =~ m/<a href=\"players\.txt\"/ ) {
$plyrurl = "$dayurl/$game/players.txt";
$response = $browser->get($plyrurl);
die "Couldn't get $plyrurl: ", $response->status_line, "\n"
unless $response->is_success;
$plyrhtml = $response->content;
open PLYRS, ">$gamedir/players.txt"
or die "could not open file $gamedir/players.txt: $|\n";
if($gamehtml =~ m/<a href=\"players\.xml\"/ ) {
$plyrurl = "$dayurl/$game/players.xml";
$response = $browser->get($plyrurl);
die "Couldn't get $plyrurl: ", $response->status_line, "\n"
unless $response->is_success;
$plyrhtml = $response->content;
open PLYRS, ">$gamedir/players.xml"
or die "could not open file $gamedir/players.xml: $|\n";

Fortunately, you don’t need to change any code to download the pitch-by-pitch files since the filenames remain unchanged from the 2005 season, even though the data content has changed.

Before you can run it, though, you need to have Perl installed on your computer. You can download the Perl binaries and package manager for ActivePerl here.

Once you get Perl installed you should be ready to go. I made a separate directory to hold my game data. If you’re running from Windows, just open a command prompt window, cd to your game data directory, and then run the script.

I find that it works best to run the spider in off-peak times. Some people have reported connection problems when they try to run in peak times (during the business day or game time on the East Coast), but I haven’t seen any such problems off-peak. Plus, that’s a more respectful use of MLB’s bandwidth. If bandwidth cost is a factor for MLB, the less cost we impose on them, the more likely they are to continue to make the XML data available for free.

Creating the Database and Installing Software

The next steps are setting up a database and adapting the script to input the data into the database.

Here is a link to my PBP database structure. This is a MySQL database. If you are already familiar with MySQL, then this will give you a head start.

Also, here is a link to the code from my XML-to-database parser script. It’s still a work in progress, so please don’t mind the mess inside, but it works if you want to use it in its current state. I’ve fixed it up a bit from an earlier version I posted, cleaning up some of the subroutine calls. (Note: the copy now posted is yet a newer version that also processes umpires and fielding locations for balls in play.)

Running the XML-to-database parser script will require the Perl DBI::mysql package to be installed. If you don’t have that already, open the Perl Package Manager (under the Active Perl program group if you installed Active Perl). Under the View menu, select All Packages, and look for DBI and DBD-mysql. If they aren’t listed as installed, click on them to select them, and then go to the Action menu and Install them.

Now on to downloading and installing MySQL. Here is the MySQL 5.0 download site. If you are downloading for Windows, download the installer, and accept the default settings. You will need to set up a username and password for MySQL. Remember these since you will need to use them in the XML-to-database parser script.

MySQL has a command-line interface from which you can do everything you need. However, I don’t find it very easy to do things that way, and if you don’t either, you can install an administration interface. There are all sorts of them out there. I think MySQL even offers one on their site. The one I use is PHPMyAdmin, which is a web-based admin utility for MySQL.

Using PHPMyAdmin requires having a webserver (e.g., Apache or Microsoft IIS) running and having PHP installed. I’m going to assume for the moment that you either have an alternate administration interface for MySQL, or already have PHPMyAdmin installed. For me, the collective installation process for MySQL, Apache HTTP Server, PHP, and PHPMyAdmin was one of the biggest challenges of this project. Thus, I believe it deserves a full explanation. However, it’s something I’m just not getting around to fully documenting at this time, and I don’t want that to keep me from moving on to some other interesting aspects of the project since I know some people are already past this point. Although it isn’t the way I did it, XAMPP offers a packaged install of Perl, MySQL, Apache, PHP, and PHPMyAdmin, and I believe some people have had success with that installation path.

Once you have created your database and you’re updating it periodically by parsing the downloaded data, how do you use the data?

Analyzing the Data

The first step is running a query to get a data set from the database for analysis. Here is a link to an example query. Database queries are written in a language called SQL. The query in my example pulls all the fields from several tables for every pitch thrown this season by Jeremy Guthrie.

You can export the data from a query to Microsoft Excel. PHPMyAdmin will save the query output to a text file with fields delimited by whatever character you choose. You can then import this file into Microsoft Excel for sorting and graphing.

If, however, you find the graphing capabilities of Excel a little limiting, as I do, you may want to explore the statistical package R. R is an open-source software that has much more advanced graphing capabilities than Excel. You can also run your SQL queries directly from R and easily manipulate and partition your data sets. I’m still learning quite a bit about graphing in R as I go. The pitch speed graphs for my Brandon Webb article were made in R. They are nothing special in appearance, but they were much easier to make in R than they would have been to make in Excel from the same data set. Based on appearances, Joe P. Sheehan and a few of the other PITCHf/x researchers are also using R for their work. Joseph Adler has a chapter on installing and using R in Baseball Hacks.

Here is an example analysis using R to make a couple plots from games by Brandon Webb, one like the pitch speed graphs like I did in my post about Webb and another graphing the pitch speed versus horizontal break.

I’m just in the beginning stages myself here, so hopefully I can add more about analysis as time goes on, or you can take a look at what other people have been doing. Dr. Nathan’s analysis of a Jon Lester start shows a lot of promise in terms of classifying pitches by speed and spin direction.

What else?

I want to add umpire data and hit location data to the pitch database. One of these days I will get around to that. Harry Pavlidis at Cubs F/X has added umpire data to the database, so take a look at his work if you are interested in that. He has a copy of his script posted for you to use. Thanks, Harry!

Edit: the new version of the XML-to-database parser script adds umpire data and hit location data to the database. I have updated the PBP database structure to add the hit location fields to the at_bats table.

Note: Some of you have noted connection problems with the spider script. Kris Gardham informs me, “For those people getting connection problems, I think it’s more of a DNS thing than a MLB thing. Hard coding the IP of gd2 or into the script really seemed to speed things up.”

Edit: There are new fields available in the 2008 data. The sv_id field is a date-time stamp of when the pitch was thrown, the pitch_type is the MLBAM algorithm’s best guess at the pitch type, and type_confidence is the confidence value associated with that guess. Here is my new database structure for 2008 with these fields added to the pitch table. You can download the new database parser script to use these fields. I have an additional script to update the pitches table with the ball-strike count at each pitch. You will also need to change the spider script to download the game.xml file if you want to take advantage of the additional game info that I am parsing from that file.

The spider needs to be changed to download the game.xml file. Since the game.xml file is in the same directory as the boxscore.xml file, you can duplicate that section of the code and change “boxscore” to “game”, like so:

if($gamehtml =~ m/<a>get($gameurl);
die “Couldn’t get $gameurl: “, $response->status_line, “\n”
unless $response->is_success;
$infohtml = $response->content;
open GAME, “>$gamedir/game.xml”
or die “could not open file $gamedir/game.xml: $|\n”;
print GAME $infohtml;
close GAME;
} else {
print “warning: no xml game file for $game\n”;

UPDATE:  See this tutorial for installing a PITCHf/x database on a Mac, based on my tutorial:  The comments are worth reading even if you are using a PC.  In the comments I included links to the latest versions of my code.

Spider script: (updated in May 2011 – with new code to address timeout issues, courtesy of Matthew Bultitude)

Database structure: (updated in May 2011)

New database structure with fielder info: (updated in September 2011)

Database parser script:

New database parser script that will handle fielder info:  (updated in September 2011)

Add-on script for updating balls and strikes:

Add-on script for updating fielder information by inning: (added in September 2011)


As a result of Brandon Webb’s scoreless streak, I’ve been wanting to take a look at him. Of course, it figures that by the time I actually get around to doing so, the streak is over, as of tonight. But I’ll start out with what is intended to be the first in a series on Webb.

Tonight I want to evaluate a statement by Arizona catcher Chris Snyder, as quoted by the Arizona Republic after Webb’s previous start on August 17th.

From behind the plate for all 42 innings, catcher Chris Snyder has had as good a perspective as anyone on Brandon Webb’s scoreless streak, and he believes one of the biggest keys to it has been Webb’s willingness to reach philosophical middle ground.

Instead of throwing his sinking, two-seam fastball exclusively or going to a heavy diet of off-speed stuff, Webb has found a perfect mix, Snyder said.

“It was either he was going to be conserving his pitches, throwing a lot of two-seamers and waiting until guys hit the ball on the ground,” Snyder said, “or he was going to have eight to 10 strikeouts a game, but his pitch count was going to be up.

“Over this span, he has met in the middle with that. The guys he’s getting out, he’s getting them out within three to four pitches. He’s doing a little of both.”

Has Webb really been mixing speeds more during his scoreless streak than he did earlier in the year? We don’t have many PITCHf/x starts for Webb from early in the year; just two, in fact, and then one additional start in July prior to embarking on the streak.

So let’s look at the pitch speeds in those. It’s pretty easy to separate the fastballs from the off-speed stuff in this format. The pitch speeds are on the vertical axis, and the horizontal axis shows the pitch sequence throughout each game.

Webb, April 18, 2007

Webb, April 30, 2007

Webb, July 15, 2007

The July 15 start looks like a perfect example of mixing in the off-speed pitches. April 18 pretty much also, but on the April 30 start he seemed to rely more on the fastball. The fastball speeds are all sitting around 90, although he seemed to tire a bit toward the end of the July 15 start. (Add a couple mph to the July 15 graph, since the speeds were measured at 40 feet rather than the normal 50 or 55 feet.)

As far as results go, on April 18 Webb gave up 7 hits and 1 run in 8 innings against the Padres, with 13 strikeouts. On April 30, he gave up 4 hits and 1 run in 7 innings against the Dodgers, with only 2 strikeouts. On July 15, he surrendered 11 hits and 4 runs in 5 and 2/3 innings against the Padre, with 10 strikeouts. And sure enough, most of the damage (6 hits and 3 runs) came in the 5th and 6th innings when Webb appears to have been tiring, based on his velocity.

Now we come to the streak.

Webb, July 25, 2007

Webb, July 31, 2007

Webb, August 5, 2007

Webb, August 11, 2007

Webb, August 17, 2007


Only in the July 31 start was Webb mixing pitches extensively. He relied pretty heavily on the fastball in the early going on August 5, and also threw more fastballs on July 25 and August 11. The start after which we have Snyder’s comments, August 17, is a pretty well-mixed start, I suppose. Again, velocity-wise, he’s sitting close to 90 in most starts.

All in all, it doesn’t seem to be much different than earlier in the year. Snyder did catch all these games that we show here. Of course, we have a limited selection from earlier in the year, so maybe we just happened to catch some games from that time period where Webb was mixing pitches while he was relying more on his fastball in games for which we don’t have data.

Snyder did make one other comment in the Republic article which is worth some further investigation, saying that Webb’s changeup had more break recently. Tune in next time…



The New York Yankees recently promoted 21-year-old Joba Chamberlain to the big league club. Chamberlain, a supplemental first-round pick in 2006, started the year with the Tampa club in the A-ball Florida State League. He dominated three levels of the minors, striking out 135 batters in 88 1/3 innings before earning a promotion to the majors on August 7th.

Chamberlain has made three appearances for New York, throwing five scoreless innings, allowing one hit and two walks, and striking out eight. He was a starting pitcher in the minors but is currently serving as the setup man for closer Mariano Rivera in the Yankees’ bullpen.

So far, he has only made one appearance in a stadium equipped with the PITCHf/x camera tracking system, this on August 10th in Cleveland. In a two-inning stint, he mowed down the Indians–six up, six down.

But what pitches did he throw? Mostly fastballs and sliders, showing the slider more to lefties than to righties. His fastball ran 95-99 mph, and his slider sat at 86-88 mph with a 3-inch break in toward a left-handed batter. He also tossed a couple of what appear to me to be change-ups, at 82-84 mph, with an 8-inch break away from a left-handed batter. (The movement on those pitches is quite different than his fastball, but it’s breaking the wrong way to be a curveball.)

The chart below shows the pitch speed versus horizontal break in inches, as seen from the catcher’s perspective.

Joba Chamberlain pitch speed versus break

If we had more data, we could look at which pitch he was using as his strikeout pitch (the slider in all four cases in Cleveland), what pitch he favored in different counts, or to lefties versus righties, or how he locates different pitches. However, with data for only 21 pitches, there’s not much of significance to be gathered along those avenues at this time.

Note: Here is an updated analysis of Joba Chamberlain.

The updated version of this catalog is now hosted by The Hardball Times.

In the week since I published and updated my catalog, there have been a few new articles published.

Our favorite PITCHf/x author, Joe P. Sheehan, has another good one in a long line of great articles.

  • On August 10, he published “Makin’ a Filter”, an article about an automated method for classifying all pitches into either fastball or off-speed and drawing some conclusions from the data about when each type of pitch is thrown.

Steve West at Go Rangers! has been cranking out the analysis:

Dr. Alan Nathan has published a very interesting paper on his Physics of Baseball site.

  • On August 8, he published “Analysis of PITCHf/x Pitched Baseball Trajectories” (PDF), a paper looking at Jon Lester’s start on August 3rd and classifying his pitches according to speed, spin magnitude, and spin direction. This technique shows important promise for pitch classification independent of release point measurement distance (y0). Plus there’s a lot of cool stuff on equations of motions for any other physicists out there.

We also welcome a new author to the PITCHf/x analysis fold in Chris Constancio.

Anthony has a short article up at Friar Watch.

Finally, ultxmxpx has updated his pitch classification page.

August 17 and August 25 updates:

Dan Fox has a new article up at Baseball Prospectus (subscription required).

Steve West has a new article at Go Rangers! .

Dave Cameron has a new article at U.S.S. Mariner.

Joe P. Sheehan has a new article at Baseball Analysts.

Once again we extend a welcome to a new author, this time Harry at his new blog Cubs F/X.

September 1 update:

Dan Fox has a couple new articles at various places.

  • On August 25, the Rocky Mountain SABR chapter published “Jimenez Delivers”, an article about Ubaldo Jimenez’s August 25th start.
  • On August 27, the BP Unfiltered blog published “Changeup Challenged”, an article classifying Billy Wagner’s pitches.

Steve West has a new article at Go Rangers!

  • On August 30, he published “Danks but No Danks”, an article looking at pitcher John Danks and classifying his pitches.

Harry Pavlidis has been cranking out the analysis at Cubs F/X.

Once again, a hearty welcome goes out as we welcome another PITCHf/x researcher to the fold: mb22414 from the Friar Forecast blog.

Cafe Whither has another Chinese-language article.

I’m running the first few queries on my pitch-by-pitch database. I’ve still got a few kinks to work out. I haven’t really spent much time validating the data other than on a couple games I picked for test cases. I want to parse a few more of the XML files and insert them into the database. Still, I’m excited! My database is finally up and running and giving me interesting PITCHf/x data!

To lead off, a question I’ve had, and one I’ve seen from others, is which stadia have the PITCHf/x system installed? There are nine parks that have had the system most of the year:

  • Seattle, 91% of pitches recorded with PITCHf/x
  • San Diego, 91%
  • Chicago White Sox, 91%
  • LA Anaheim, 86%
  • Toronto, 84%
  • LA Dodgers, 82%
  • Atlanta, 82%
  • Oakland, 80%
  • Texas, 77%

Other cities have been brought on line throughout the year, including the following teams:

  • Chicago Cubs, 33% of pitches recorded with PITCHf/x
  • Milwaukee, 22%
  • Boston, 21%
  • San Francisco, 19%
  • Arizona, 18%
  • Minnesota, 14%
  • St. Louis, 12%
  • Detroit, 12%
  • Houston, 10%
  • Cleveland, 9%
  • Cincinnati, 4%

Then there are four parks that have only a partial game’s worth of data:

  • Washington – 102 pitches
  • Colorado – 91 pitches
  • Kansas City – 50 pitches
  • Tampa Bay – 45 pitches

Finally, there are six parks that haven’t recorded any PITCHf/x data:

  • Baltimore
  • Florida
  • New York Mets
  • New York Yankees
  • Philadelphia
  • Pittsburgh

So, which pitchers have thrown the most pitches that have been recorded with the Enhanced Gameday PITCHf/x system this year?

  • Miguel Batista – 1,782
  • Dan Haren – 1,705
  • Kelvim Escobar – 1,627
  • Jarrod Washburn- 1,574
  • Jake Peavy – 1,548
  • John Lackey – 1,448
  • Joe Blanton – 1,428
  • Mark Buehrle – 1,397
  • Javier Vazquez – 1,397
  • Derek Lowe – 1,391
  • Chris Young – 1,356
  • Roy Halladay – 1,349
  • Kevin Millwood – 1,338
  • Tim Hudson – 1,323
  • Jon Garland – 1,294
  • Jered Weaver – 1,279
  • Felix Hernandez – 1,247
  • Brad Penny – 1,244
  • Chad Gaudin – 1,226
  • Shaun Marcum – 1,175
  • David Wells – 1,173
  • Kameron Loe – 1,147
  • John Smoltz – 1,147
  • Robinson Tejeda – 1,133
  • Jeff Weaver – 1,085
  • John Danks – 1,074
  • Greg Maddux – 1,029
  • Chuck James – 1,026
  • Justin Germano – 984
  • Mark Hendrickson – 979

That’s the top thirty. But what about notable relief pitchers that have an extensive history captured by the system this year?

  • Brandon Morrow – 606
  • J.J. Putz – 589
  • Heath Bell – 587
  • Scot Shields – 572
  • Joaquin Benoit – 547
  • C.J. Wilson – 530
  • Jonathan Broxton – 529
  • Rudy Seanez – 522
  • Oscar Villareal – 502
  • Dustin Moseley – 499

And some of the big-name closers?

  • Francisco Rodriguez – 453
  • Takashi Saito – 430
  • Eric Gagne – 388
  • Bobby Jenks – 362
  • Trevor Hoffman – 345
  • Bob Wickman – 294

While I’m working on getting my pitch-by-pitch database up and running, I decided to have a little fun looking at the average game time weather conditions in the 30 major league ballparks, based on the weather information from Gameday.

The windiest park? AT&T/Pacbell/whatever it is these days in San Francisco. The calmest park? Besides the Metrodome (when they’re not turning on the air conditioning to blow opponents’ home run balls back into play), it is Safeco Field in Seattle. The following shows the average game time wind speed for each park. For parks with retractable roofs, the data is only for games when the roof was open at game time (and a wind speed was recorded).

Home Wind #Games
sfn 17 52
oak 14 58
nyn 12 54
bos 12 58
tor 12 31
tex 11 54
flo 11 57
cha 11 54
mil 11 31
was 10 58
phi 10 52
det 10 50
nya 10 58
hou 10 13
sdn 9 57
cle 9 60
chn 9 57
kca 9 55
sln 9 50
ari 9 29
atl 9 59
col 8 52
pit 8 58
bal 8 52
cin 8 53
ana 8 51
lan 7 58
tba 6 3
sea 4 48
min 0 0

The hottest park? Arizona’s Chase Field. The coldest park? San Francisco. Having experienced July in San Francisco and agreeing with Mark Twain’s sentiments, I can certainly believe that the new park is both the windiest and coldest on average. All thirty parks are listed below in order of average game time temperature.

Home Temp
ari 84
flo 81
tex 79
bal 78
atl 78
sln 77
was 75
cin 74
col 74
hou 74
kca 73
phi 73
tba 72
ana 72
mil 72
tor 71
det 71
pit 71
nya 71
nyn 71
lan 70
min 70
chn 69
cle 68
sdn 68
bos 67
cha 66
sea 64
oak 63
sfn 63

At Major League Baseball’s Gameday data website, the PITCHf/x data is included in the inning and pbp/pitchers XML files. What follows is an explanation of the attributes of the pitch element within these XML files.

At stadiums without the PITCHf/x system installed, the pitch element includes only five attributes:

  • des: a brief text description of the result of the pitch: Ball; Ball In Dirt; Called Strike; Foul; Foul (Runner Going); Foul Tip; Hit by Pitch; In play, no out; In play, out(s); In play, run(s); Intent Ball; Pitchout; Swinging Strike; Swinging Strike (Blocked).
  • id: a unique identification number per pitch within a game. The numbers increment by one for each pitch but are not consecutive between at bats.
  • type: a one-letter abbreviation for the result of the pitch: B, ball; S, strike (including fouls); X, in play.
  • x, y: the horizontal and vertical location of the pitch as it crossed home plate as input by the Gameday stringer using the old Gameday coordinate system. I’m not sure what units are used or where the origin is located. Note that the y dimension in the old coordinate system is now called the z dimension in the new PITCHf/x coordinate system detailed below.

Stadiums with the PITCHf/x camera system have an additional twenty attributes recorded in the pitch element:

  • start_speed: the pitch speed, in miles per hour and in three dimensions, measured at the initial point, y0. Of the two speeds, this one is closer to the speed measured by a radar gun and what we are familiar with for a pitcher’s “velocity” .
  • end_speed: the pitch speed measured as it crossed the front of home plate.
  • sz_top: the distance in feet from the ground to the top of the current batter’s rulebook strike zone as measured from the video by the PITCHf/x operator. The operator sets a line at the batter’s belt as he settles into the hitting position, and the PITCHf/x software adds four inches up for the top of the zone.
  • sz_bot: the distance in feet from the ground to the bottom of the current batter’s rulebook strike zone. The PITCHf/x operator sets a line at the hollow of the knee for the bottom of the zone.
  • pfx_x: the horizontal movement, in inches, of the pitch between the release point and home plate, as compared to a theoretical pitch thrown at the same speed with no spin-induced movement. This parameter is measured at y=40 feet regardless of the y0 value.
  • pfx_z: the vertical movement, in inches, of the pitch between the release point and home plate, as compared to a theoretical pitch thrown at the same speed with no spin-induced movement. This parameter is measured at y=40 feet regardless of the y0 value.
  • px: the left/right distance, in feet, of the pitch from the middle of the plate as it crossed home plate. The PITCHf/x coordinate system is oriented to the catcher’s/umpire’s perspective, with distances to the right being positive and to the left being negative.
  • pz: the height of the pitch in feet as it crossed the front of home plate.
  • x0: the left/right distance, in feet, of the pitch, measured at the initial point.
  • y0: the distance in feet from home plate where the PITCHf/x system is set to measure the initial parameters. This parameter has been variously set at 40, 50, or 55 feet (and in a few instances 45 feet) from the plate at different times throughout the 2007 season as Sportvision experiments with optimal settings for the PITCHf/x measurements. Sportvision settled on 50 feet in the second half of 2007, and this value of y0=50 feet has been used since. Changes in this parameter impact the values of all other parameters measured at the release point, such as start_speed.
  • z0: the height, in feet, of the pitch, measured at the initial point.
  • vx0, vy0, vz0: the velocity of the pitch, in feet per second, in three dimensions, measured at the initial point.
  • ax, ay, az: the acceleration of the pitch, in feet per second per second, in three dimensions, measured at the initial point.
  • break_y: the distance in feet from home plate to the point in the pitch trajectory where the pitch achieved its greatest deviation from the straight line path between the release point and the front of home plate.
  • break_angle: the angle, in degrees, from vertical to the straight line path from the release point to where the pitch crossed the front of home plate, as seen from the catcher’s/umpire’s perspective.
  • break_length: the measurement of the greatest distance, in inches, between the trajectory of the pitch at any point between the release point and the front of home plate, and the straight line path from the release point and the front of home plate, per the MLB Gameday team. John Walsh’s article “In Search of the Sinker” has a good illustration of this parameter.

Three new fields were added to the pitch element for 2008:

  • sv_id: a date/time stamp of when the PITCHf/x tracking system first detected the pitch in the air, it is in the format YYMMDD_hhmmss.
  • pitch_type: the most probable pitch type according to a neural net classification algorithm developed by Ross Paul of MLBAM.
  • type_confidence: the value of the weight at the classification algorithm’s output node corresponding to the most probable pitch type, this value is multiplied by a factor of 1.5 if the pitch is known by MLBAM to be part of the pitcher’s repertoire.

Resources for this glossary included the following:

Note: Shortly after I wrote this, I found that Dr. Alan Nathan has published a better glossary than mine at his excellent Physics of Baseball site.

His freshman physics lectures on the Physics of Baseball at the University of Illinois is also an excellent primer to understanding the calculations surrounding baseball trajectories.

If you need to convert’s player ID’s into names or Lahman database player ID’s, you can consult my list of player ID’s.

Another common question is about the meaning of the BRK and PFX numbers reported in the Gameday application. Here’s what I wrote about the subject on another website:

There are three main forces acting on a spinning baseball: gravity, drag, and the spin force (also called the Magnus or lift force).

The drag force mainly acts to slow a pitch down, it doesn’t have much effect on the movement/break of a pitch, except for very, very slowly spinning pitches (i.e., knuckleballs).

The force of gravity is the same on all pitches, but it has a greater effect on the movement of slow pitches because it has longer to act on them before they reach the plate. Curveballs and changeups drop more due to gravity than fastballs do because they are slower pitches.

Finally, the spin force acts differently on fastballs and curveballs, as the Gameday folks described. Because a fastball is thrown with backspin, the spin force pushes the ball up, counteracting to some extent the force of gravity that is pulling the ball down. This makes the fastball trajectory straighter. Because a curveball is thrown with topspin, the spin force pushes the ball down, reinforcing gravity which is also pushing it down. This makes the curveball drop even more.

Thus, the curveball trajectory has a big bend and the fastball trajectory is relatively straight. The amount of bend in the trajectory is what is being measured by the BRK parameter on Gameday.

The amount of deflection by the spin force is what is being measured by the PFX parameter on Gameday. This PFX deflection is mostly upward for a fastball, meaning that it counteracts roughly 10 or so inches of the drop due to gravity, and the PFX deflection is mostly downward for a curveball, meaning that it adds an additional 6 or so inches of drop in addition to that from gravity.

Additionally, here is a diagram, adapted from John Walsh, that illustrates the break parameters:

The updated version of this catalog is now hosted by The Hardball Times.

In the 2006 postseason, Major League Baseball introduced a feature they called Enhanced Gameday, based on the PITCHf/x camera system from Sportvision. Based in Mountain View, California, Sportvision is also the creator of the yellow first-down line that appears on football telecasts. The PITCHf/x cameras capture the speed and location of the pitched baseball throughout its flight to home plate. Dr. Alan Nathan has a good description of how the system works on his Physics of Baseball site. MLB has, for now, made this data accessible in the XML files freely available for download from its Gameday website.

This summer, a number of baseball analysts have begun using this new data to tackle questions heretofore out of the realm of the common researcher. Others have made brief summaries of the published research, but I haven’t seen a comprehensive catalog. What follows is an attempt at such a catalog. If I have overlooked anyone or left out any articles, please bring it to my attention. Order is not necessarily a reflection of importance, but I have tried to list the earliest and/or most groundbreaking work first.

(If you want a quick introduction to the field without having to read through all the articles below, I recommend John Walsh’s “In Search of the Sinker” and Joe P. Sheehan’s “More Fun with Enhanced Gameday”.)

First and foremost among these analysts is Joe P. Sheehan, who has published nine articles on the topic at Baseball Analysts, one at The Hardball Times, and one at his Juice of Jesus blog.

  • On February 28, he published “Fingerprinting Jeff Weaver”, an article about pitcher Jeff Weaver, classifying his pitches based on his performance in the 2006 playoffs with St. Louis.
  • On March 29, he published “Enhanced Gameday”, an article about various pitchers from the 2006 playoffs, looking at pitch location, release points, and classifying pitches for Mike Mussina.
  • On April 18, THT published “Another Look at Enhanced Gameday”, an article about pitch selection, velocity, and classifying pitches for Felix Hernandez and Kevin Millwood.
  • On April 19, he published “More Fun with Enhanced Gameday”, an article about consistency of pitch movement, location, and release points, and classifying pitches for John Lackey.
  • On April 26, he published “That Sinking Feeling”, an article about the two-seam fastball, examining pitchers Derek Lowe, Aaron Cook, Brandon Webb, and Carlos Silva.
  • On May 11, he published “Location, Location, Location”, tracking the location, BABIP, and swing percentage of pitches in different grid locations around the strike zone.
  • On May 25, he published “Dangerous Curves”, on the break of the curveball and pitchers Barry Zito and Rich Hill.
  • On June 14, he published “Ch-ch-ch-ch-changes…”, on the effectiveness of the changeup and pitchers Cole Hamels, Josh Beckett, Trevor Hoffman, and Johan Santana.
  • On June 28, he published “Is There Something in the Way It Moves?”, on pitcher Roy Halladay and the consistency of his stuff from start to start.
  • On July 13, he published “Under Pressure”, on pitchers Jake Peavy and Dan Haren and their pitch selection in high-pressure versus low-pressure situations.
  • On July 26, he published a collection of PITCHf/x-related notes under the title “Not an Article about Pitching at Altitude”. He discusses preliminary data about the effect of altitude on the break of pitches, and updates his BABIP charts from “Location, Location, Location”.

Another major researcher in the field is Dan Fox, author of the Schrodinger’s Bat column at Baseball Prospectus and the Dan Agonistes blog. Most of his work is available for BP subscribers only. His June-July series on Hernandez, Wakefield, and Matsuzaka is a great primer on classifying pitches.

A third important contributor is John Walsh, a nuclear physics professor at the Istituto Nazionale di Fisica Nucleare in Pisa, Italy. He has published several articles at The Hardball Times:

  • On June 6, THT published “In Search of the Sinker”, one of the definitive articles on classifying pitches. In this process John examines the sinking fastball and pitcher Randy Wolf.
  • On June 26, THT published “Schilling’s Aching Schoulder”, an article about detecting Curt Schilling’s shoulder injury in the Enhanced Gameday data.
  • On July 11, THT published “Strike Zone: Fact vs. Fiction”, an article about the strike zone, as called by the umpires, versus right-handed and left-handed batters.
  • On July 25, THT published John’s followup article “The Eye of the Umpire”, containing some refinements to the previous article. The pair of articles are an excellent beginning to strike zone analysis.

Next we come to John Beamer, author of a couple articles at The Hardball Times:

Over at the Go Rangers! blog, Steve West is another PITCHf/x researcher.

  • On June 14, he published “Rotation Release Points”, an article about the release point consistency of the starting rotations of Texas, Anaheim, and Oakland.
  • On June 18, he published “Rangers Rotation Pitch Types”, an article classifying the pitch types of Texas pitchers Kevin Millwood, Brandon McCarthy, Kameron Loe, Vicente Padilla, and Robinson Tejeda.
  • On June 25, he published “Rangers Rotation Release Points Redux”, an article examining whether the Texas starters were tipping their pitches by varying release points.
  • On July 12, he published “Pitch Break Angle vs Length”, and article about classifying pitch types using these two new PITCHf/x parameters. This article is a useful complement to the pitch classification work of Fox, Walsh, and Beamer.
  • On July 30, he published “Do Pitchers Affect the Strike Zone”, a look at batted ball results plotted against pitch location in the strike zone.

Next up is Bill Ferris at the Detroit Tigers Weblog with a few articles on Tiger pitchers. Because Bill is so prolific a blogger, I may have missed something of his on the topic.

Louis Chao has one article on The Hardball Times. Hat tip to Whither for pointing me to his first Chinese-language article at Andre’s Baseball Blog.

  • On June 26, he published, in Chinese, “Diggin’ in on the Sinker”, an article looking at sinkerball pitchers. Putting this article through Google Translate to English makes it clear why that tool is still in beta. 😉
  • On July 12, THT published “Another Look at the Sinker”, an article comparing the two-seam and four-seam fastballs and looking at pitcher A.J. Burnett.

At the beansTown blog, Steve Calcagno has one article.

Over at the U.S.S. Mariner, Dave Cameron has one article using Enhanced Gameday.

A Devil Rays’ fan who posts around the web as ultxmxpx has a website where he posted some Enhanced Gameday pitch analysis.

  • On June 3, he posted “Shields/Kazmir pitch selection” as a topic on the RaysBaseball forum.
  • On July 10, he published this list of pitchers and contact rate on their various pitches, which he had classified.

Anthony has Enhanced Gameday stuff scattered here and there throughout his blog at Friar Watch.

There are also some other Chinese language articles on this topic. I wish I could understand a bit more than what I can gather from Google translations, but I’ll present them here for those of you who do know Chinese.

Even if you’re an English speaker, you might be able to gather something from the Google translation and the graphs at Cafe Whither. Thanks to Whither for the first link.

And here are a couple other Chinese language articles.

I hope to contribute a little of my own work in the near future, but I felt the right place to start this discussion was with a recap of what has been done. Enjoy!

Update: New articles listed in this post.

Full list of articles by author.

Full list of articles by date.

Articles about pitchers listed by pitcher.