Note: links to updated versions of scripts and database structure are found at the end of this post. Also, I highly recommend using XAMPP (or Mac equivalent) to install all this software with one easy installer rather than installing Perl, MySQL, etc., piecemeal.

One of my hopes for this project is to share the pitch data in a way that facilitates the analysis of others. There are many others out there whose knowledge of statistics, data analysis, graphics, physics, and/or baseball far exceeds mine.

My pitch database itself is too large a file to easily share, so I’ll do the next best thing, what I hope may be a better thing after all, to try to document the process I used to create it so that others can create a pitch database for themselves. This post may be an ongoing work as I both recreate what I have already done and lay out what work remains for me to do.

First of all, if you just want to get your feet wet using Microsoft Excel to analyze a single game’s worth of pitch-by-pitch XML data, Dr. Alan Nathan has laid out the steps for you at his Physics of Baseball site.

If you desire a whole season’s worth of pitch data (over half a million pitches) stored in a relational database, with visions of all sorts of wonderful analysis that would enable, follow me!

Downloading the Data

The first place to start is with downloading the XML data from Major League Baseball’s Gameday website. But if we’re going to download thousands of games, each with hundreds of pitches, that’s not something we want to do manually. Fortunately, we can leverage a very useful book by Joseph Adler, Baseball Hacks, published by O’Reilly in January 2006. Some parts of his hacks are outdated or nonfunctional, but other parts I found very useful for this project.

Whether or not you want to buy the book, the Perl scripts he wrote for the book can be downloaded from the examples section of O’Reilly’s website. Download and uncompress the file. It contains all the scripts from the book, divided by chapter and hack number. The first script of interest for XML downloading is, found in Chapter 3. This script can be used after only a few minor modifications, as follows:

To download the 2007 season starting with April 2, change Line 35 from
$start = timelocal(0,0,0,20,6,105);
$start = timelocal(0,0,0,2,3,107);

Similarly, you need to change the end time. Uncomment Line 40 and comment Line 41:
$now = timelocal(0,0,0,$mday - 1,$mon,$year);
#$now = timelocal(0,0,0,3,10,105);

These statements determines the first and last dates between which the XML game files will be downloaded. The statements use the function timelocal(seconds, minutes, hours, days, months, years) where days is the day of the month 1-31, months is the month 0-11 (January=0, December=11), and years are since 1900 (2007 = 107).

Another change is to correct for the fact that MLB now puts player information in an XML file rather than a TXT file.

Change the players.txt in Lines 95, 96, 101, and 102 to players.xml:
if($gamehtml =~ m/<a href=\"players\.txt\"/ ) {
$plyrurl = "$dayurl/$game/players.txt";
$response = $browser->get($plyrurl);
die "Couldn't get $plyrurl: ", $response->status_line, "\n"
unless $response->is_success;
$plyrhtml = $response->content;
open PLYRS, ">$gamedir/players.txt"
or die "could not open file $gamedir/players.txt: $|\n";
if($gamehtml =~ m/<a href=\"players\.xml\"/ ) {
$plyrurl = "$dayurl/$game/players.xml";
$response = $browser->get($plyrurl);
die "Couldn't get $plyrurl: ", $response->status_line, "\n"
unless $response->is_success;
$plyrhtml = $response->content;
open PLYRS, ">$gamedir/players.xml"
or die "could not open file $gamedir/players.xml: $|\n";

Fortunately, you don’t need to change any code to download the pitch-by-pitch files since the filenames remain unchanged from the 2005 season, even though the data content has changed.

Before you can run it, though, you need to have Perl installed on your computer. You can download the Perl binaries and package manager for ActivePerl here.

Once you get Perl installed you should be ready to go. I made a separate directory to hold my game data. If you’re running from Windows, just open a command prompt window, cd to your game data directory, and then run the script.

I find that it works best to run the spider in off-peak times. Some people have reported connection problems when they try to run in peak times (during the business day or game time on the East Coast), but I haven’t seen any such problems off-peak. Plus, that’s a more respectful use of MLB’s bandwidth. If bandwidth cost is a factor for MLB, the less cost we impose on them, the more likely they are to continue to make the XML data available for free.

Creating the Database and Installing Software

The next steps are setting up a database and adapting the script to input the data into the database.

Here is a link to my PBP database structure. This is a MySQL database. If you are already familiar with MySQL, then this will give you a head start.

Also, here is a link to the code from my XML-to-database parser script. It’s still a work in progress, so please don’t mind the mess inside, but it works if you want to use it in its current state. I’ve fixed it up a bit from an earlier version I posted, cleaning up some of the subroutine calls. (Note: the copy now posted is yet a newer version that also processes umpires and fielding locations for balls in play.)

Running the XML-to-database parser script will require the Perl DBI::mysql package to be installed. If you don’t have that already, open the Perl Package Manager (under the Active Perl program group if you installed Active Perl). Under the View menu, select All Packages, and look for DBI and DBD-mysql. If they aren’t listed as installed, click on them to select them, and then go to the Action menu and Install them.

Now on to downloading and installing MySQL. Here is the MySQL 5.0 download site. If you are downloading for Windows, download the installer, and accept the default settings. You will need to set up a username and password for MySQL. Remember these since you will need to use them in the XML-to-database parser script.

MySQL has a command-line interface from which you can do everything you need. However, I don’t find it very easy to do things that way, and if you don’t either, you can install an administration interface. There are all sorts of them out there. I think MySQL even offers one on their site. The one I use is PHPMyAdmin, which is a web-based admin utility for MySQL.

Using PHPMyAdmin requires having a webserver (e.g., Apache or Microsoft IIS) running and having PHP installed. I’m going to assume for the moment that you either have an alternate administration interface for MySQL, or already have PHPMyAdmin installed. For me, the collective installation process for MySQL, Apache HTTP Server, PHP, and PHPMyAdmin was one of the biggest challenges of this project. Thus, I believe it deserves a full explanation. However, it’s something I’m just not getting around to fully documenting at this time, and I don’t want that to keep me from moving on to some other interesting aspects of the project since I know some people are already past this point. Although it isn’t the way I did it, XAMPP offers a packaged install of Perl, MySQL, Apache, PHP, and PHPMyAdmin, and I believe some people have had success with that installation path.

Once you have created your database and you’re updating it periodically by parsing the downloaded data, how do you use the data?

Analyzing the Data

The first step is running a query to get a data set from the database for analysis. Here is a link to an example query. Database queries are written in a language called SQL. The query in my example pulls all the fields from several tables for every pitch thrown this season by Jeremy Guthrie.

You can export the data from a query to Microsoft Excel. PHPMyAdmin will save the query output to a text file with fields delimited by whatever character you choose. You can then import this file into Microsoft Excel for sorting and graphing.

If, however, you find the graphing capabilities of Excel a little limiting, as I do, you may want to explore the statistical package R. R is an open-source software that has much more advanced graphing capabilities than Excel. You can also run your SQL queries directly from R and easily manipulate and partition your data sets. I’m still learning quite a bit about graphing in R as I go. The pitch speed graphs for my Brandon Webb article were made in R. They are nothing special in appearance, but they were much easier to make in R than they would have been to make in Excel from the same data set. Based on appearances, Joe P. Sheehan and a few of the other PITCHf/x researchers are also using R for their work. Joseph Adler has a chapter on installing and using R in Baseball Hacks.

Here is an example analysis using R to make a couple plots from games by Brandon Webb, one like the pitch speed graphs like I did in my post about Webb and another graphing the pitch speed versus horizontal break.

I’m just in the beginning stages myself here, so hopefully I can add more about analysis as time goes on, or you can take a look at what other people have been doing. Dr. Nathan’s analysis of a Jon Lester start shows a lot of promise in terms of classifying pitches by speed and spin direction.

What else?

I want to add umpire data and hit location data to the pitch database. One of these days I will get around to that. Harry Pavlidis at Cubs F/X has added umpire data to the database, so take a look at his work if you are interested in that. He has a copy of his script posted for you to use. Thanks, Harry!

Edit: the new version of the XML-to-database parser script adds umpire data and hit location data to the database. I have updated the PBP database structure to add the hit location fields to the at_bats table.

Note: Some of you have noted connection problems with the spider script. Kris Gardham informs me, “For those people getting connection problems, I think it’s more of a DNS thing than a MLB thing. Hard coding the IP of gd2 or into the script really seemed to speed things up.”

Edit: There are new fields available in the 2008 data. The sv_id field is a date-time stamp of when the pitch was thrown, the pitch_type is the MLBAM algorithm’s best guess at the pitch type, and type_confidence is the confidence value associated with that guess. Here is my new database structure for 2008 with these fields added to the pitch table. You can download the new database parser script to use these fields. I have an additional script to update the pitches table with the ball-strike count at each pitch. You will also need to change the spider script to download the game.xml file if you want to take advantage of the additional game info that I am parsing from that file.

The spider needs to be changed to download the game.xml file. Since the game.xml file is in the same directory as the boxscore.xml file, you can duplicate that section of the code and change “boxscore” to “game”, like so:

if($gamehtml =~ m/<a>get($gameurl);
die “Couldn’t get $gameurl: “, $response->status_line, “\n”
unless $response->is_success;
$infohtml = $response->content;
open GAME, “>$gamedir/game.xml”
or die “could not open file $gamedir/game.xml: $|\n”;
print GAME $infohtml;
close GAME;
} else {
print “warning: no xml game file for $game\n”;

UPDATE:  See this tutorial for installing a PITCHf/x database on a Mac, based on my tutorial:  The comments are worth reading even if you are using a PC.  In the comments I included links to the latest versions of my code.

Spider script: (updated in May 2011 – with new code to address timeout issues, courtesy of Matthew Bultitude)

Database structure: (updated in May 2011)

New database structure with fielder info: (updated in September 2011)

Database parser script:

New database parser script that will handle fielder info:  (updated in September 2011)

Add-on script for updating balls and strikes:

Add-on script for updating fielder information by inning: (added in September 2011)