pitch database


I have finally gotten around to publishing the 2008 updates to my pitch database parsing scripts.

There are new fields available in the 2008 data. The sv_id field is a date-time stamp of when the pitch was thrown, the pitch_type is the MLBAM algorithm’s best guess at the pitch type, and type_confidence is the confidence value associated with that guess. Starting in mid-May, there are also b_height and p_throws fields in the pitch element. I don’t currently use those fields. I get the pitcher throwing hand from the players’ information, and I don’t record the batter height at this time.

Here is my new database structure for 2008 with these fields added to the pitch table. You can download the new database parser script to use these fields. I have an additional script to update the pitches table with the ball-strike count at each pitch.

I used the time stamp data to look at how quickly pitchers work, and I wrote an article on this topic at The Hardball Times. Several people have asked or been curious about the pitch time data for all the pitchers on their time. Here are the data that I compiled as of June 5.

Angels
20.0 Ervin Santana
20.3 Joe Saunders
21.0 John Lackey
22.2 Jered Weaver
22.7 Dustin Moseley
23.4 Jon Garland
20.5 Chris Bootcheck

20.7 Darren O’Day
20.7 Scot Shields
20.8 Justin Speier
21.8 Jose Arredondo
23.2 Darren Oliver
24.1 Francisco Rodriguez

Astros
19.3 Roy Oswalt
19.6 Wandy Rodriguez
21.0 Jack Cassel
21.3 Shawn Chacon
22.2 Chris Sampson
24.4 Brian Moehler
24.7 Brandon Backe

20.9 Oscar Villarreal
21.1 Dave Borkowski
22.0 Doug Brocail
22.7 Tim Byrdak
23.2 Wesley Wright
24.4 Geoff Geary
28.0 Jose Valverde

Athletics
17.6 Joe Blanton
18.9 Rich Harden
20.0 Justin Duchscherer
20.6 Chad Gaudin
20.9 Gregory Smith
21.4 Dana Eveland

18.7 Dallas Braden
21.4 Keith Foulke
22.4 Joey Devine
22.5 Santiago Casilla
22.8 Huston Street
23.4 Andrew Brown
25.3 Alan Embree

Blue Jays
19.9 Jesse Litsch
22.1 Roy Halladay
22.5 Shaun Marcum
24.6 Dustin McGowan
24.7 A.J. Burnett

19.5 Jesse Carlson
21.3 B.J. Ryan
22.8 Shawn Camp
24.0 Brian Tallet
24.4 Jeremy Accardo
26.0 Scott Downs
26.6 Jason Frasor

Braves
19.3 Chuck James
19.5 Jo-Jo Reyes
20.1 Tom Glavine
22.1 John Smoltz
22.1 Jair Jurrjens
22.5 Tim Hudson

21.1 Jorge Campillo
21.4 Jeff Bennett
22.1 Manny Acosta
22.5 Blaine Boyer
22.8 Will Ohman
23.0 Royce Ring
25.7 Chris Resop

Brewers
18.4 Ben Sheets
20.9 David Bush
22.0 Yovani Gallardo
22.2 Manny Parra
22.4 Carlos Villanueva
23.3 Jeff Suppan

21.0 Mitch Stetter
21.3 Brian Shouse
22.9 David Riske
23.4 Seth McClung
23.6 Salomon Torres
25.9 Eric Gagne
26.4 Guillermo Mota

Cardinals
19.9 Kyle Lohse
20.2 Braden Looper
20.7 Todd Wellemeyer
21.2 Brad Thompson
21.5 Joel Pineiro
22.0 Adam Wainwright

20.0 Kyle McClellan
20.7 Anthony Reyes
21.1 Michael Parisi
22.7 Randy Flores
23.0 Ryan Franklin
25.0 Ron Villone
25.4 Russ Springer
27.4 Jason Isringhausen

Cubs
18.8 Rich Hill
18.9 Sean Gallagher
19.0 Carlos Zambrano
20.1 Ryan Dempster
21.3 Jason Marquis
22.1 Ted Lilly

17.8 Jon Lieber
20.1 Carlos Marmol
20.7 Kerry Wood
21.5 Mike Wuertz
24.9 Kevin Hart
26.5 Bob Howry

Diamondbacks
20.9 Randy Johnson
21.5 Brandon Webb
21.7 Dan Haren
22.2 Micah Owings
22.6 Max Scherzer
23.2 Doug Davis
24.5 Edgar Gonzalez

20.5 Doug Slaten
24.1 Brandon Lyon
24.1 Brandon Medders
25.2 Tony Pena
25.3 Chad Qualls
25.6 Juan Cruz

Dodgers
18.4 Esteban Loaiza
18.6 Derek Lowe
20.0 Clayton Kershaw
21.9 Brad Penny
22.6 Chad Billingsley
23.4 Hiroki Kuroda

19.9 Cory Wade
22.5 Scott Proctor
22.8 Chan Ho Park
24.2 Takashi Saito
25.5 Hong-Chih Kuo
26.0 Jonathan Broxton
26.6 Joe Beimel

Giants
19.7 Matt Cain
20.4 Tim Lincecum
21.4 Barry Zito
21.7 Pat Misch
22.0 Jonathan Sanchez
22.4 Kevin Correia

21.9 Billy Sadler
22.6 Merkin Valdez
23.6 Brian Wilson
23.7 Keiichi Yabu
24.1 Brad Hennessey
25.5 Vinnie Chulk
26.2 Jack Taschner
27.0 Tyler Walker

Indians
20.2 Aaron Laffey
21.4 Jake Westbrook
21.5 Jeremy Sowers
21.7 Paul Byrd
21.8 Cliff Lee
23.3 Fausto Carmona
23.7 C.C. Sabathia

20.7 Jorge Julio
21.6 Jensen Lewis
24.2 Craig Breslow
26.9 Masa Kobayashi
29.1 Rafael Perez
32.0 Rafael Betancourt

Mariners
19.4 Carlos Silva
19.9 Jarrod Washburn
21.3 Felix Hernandez
23.2 Miguel Batista
24.5 Erik Bedard

17.9 R.A. Dickey
20.0 Ryan Rowland-Smith
21.9 Sean Green
22.2 Cha Seung Baek
22.4 Mark Lowe
22.7 Roy Corcoran
24.2 Brandon Morrow
27.9 J.J. Putz

Marlins
19.0 Scott Olsen
20.4 Andrew Miller
21.2 Burke Badenhop
22.4 Mark Hendrickson
23.7 Ricky Nolasco

20.4 Justin Miller
21.2 Doug Waechter
21.4 Kevin Gregg
22.8 Renyel Pinto
24.1 Logan Kensing
24.2 Matt Lindstrom
25.5 Taylor Tankersley

Mets
20.3 John Maine
20.8 Nelson Figueroa
22.4 Mike Pelfrey
22.7 Oliver Perez
22.7 Johan Santana
23.8 Claudio Vargas

21.3 Pedro Feliciano
21.7 Duaner Sanchez
22.8 Scott Schoeneweis
22.8 Joe Smith
22.9 Billy Wagner
23.5 Aaron Heilman
25.0 Jorge Sosa

Nationals
17.7 Jason Bergmann
19.6 Shawn Hill
20.0 John Lannan
20.2 Tim Redding
20.5 Matt Chico
22.9 Odalis Perez

20.5 Joel Hanrahan
21.8 Saul Rivera
21.9 Jon Rauch
22.3 Luis Ayala
23.8 Jesus Colome

Orioles
19.9 Adam Loewen
20.7 Jeremy Guthrie
20.9 Daniel Cabrera
22.0 Garrett Olson
22.4 Brian Burres
24.1 Steve Trachsel

21.3 Randor Bierd
22.0 Lance Cormier
22.1 Matt Albers
22.4 Jim Johnson
22.5 George Sherrill
23.7 Dennis Sarfate
24.0 Jamie Walker
24.6 Chad Bradford

Padres
19.5 Randy Wolf
19.6 Jake Peavy
20.2 Shawn Estes
20.7 Justin Germano
21.0 Greg Maddux
21.3 Chris Young
23.1 Wil Ledezma
28.1 Josh Banks

17.2 Glendon Rusch
19.6 Cla Meredith
20.4 Joe Thatcher
20.9 Mike Adams
23.1 Trevor Hoffman
24.0 Heath Bell
24.8 Bryan Corey

Phillies
19.2 Cole Hamels
19.8 Brett Myers
20.2 Adam Eaton
20.3 Jamie Moyer
21.2 Kyle Kendrick

19.1 Clay Condrey
19.6 Chad Durbin
21.2 Brad Lidge
22.4 Ryan Madson
25.1 Rudy Seanez
25.2 J.C. Romero
25.4 Tom Gordon

Pirates
20.0 Zach Duke
20.9 Phil Dumatrait
20.9 Matt Morris
21.0 Tom Gorzelanny
21.5 Paul Maholm
24.1 Ian Snell

20.2 John Grabow
21.5 Damaso Marte
21.5 Sean Burnett
22.5 Franquelis Osoria
22.8 Matt Capps
22.8 Evan Meek
26.0 Tyler Yates

Rangers
19.5 Sidney Ponson
21.3 Scott Feldman
21.3 Jason Jennings
21.4 Kason Gabbard
23.0 Douglas Mathis
23.4 Kevin Millwood
24.2 Vicente Padilla

19.5 Eddie Guardado
22.8 C.J. Wilson
23.7 Josh Rupe
24.1 Jamey Wright
24.6 Franklyn German
24.9 Frank Francisco
26.2 Joaquin Benoit

Rays
19.7 Andy Sonnanstine
22.5 James Shields
22.5 Edwin Jackson
22.6 Jason Hammel
22.8 Scott Kazmir
24.2 Matt Garza

21.7 Trever Miller
23.4 J.P. Howell
23.6 Gary Glover
25.8 Al Reyes
26.6 Troy Percival
26.9 Scott Dohmann
27.3 Dan Wheeler

Red Sox
18.9 Justin Masterson
19.3 Tim Wakefield
23.2 Bartolo Colon
23.5 Jon Lester
24.3 Daisuke Matsuzaka
26.0 Josh Beckett
26.5 Clay Buchholz

23.7 Craig Hansen
23.8 David Aardsma
25.5 Julian Tavarez
26.3 Mike Timlin
26.3 Manny Delcarmen
26.5 Javier Lopez
27.5 Hideki Okajima
28.4 Jonathan Papelbon

Reds
20.0 Bronson Arroyo
20.0 Matt Belisle
20.7 Johnny Cueto
20.9 Josh Fogg
21.1 Aaron Harang
23.2 Edinson Volquez

19.6 Mike Lincoln
19.7 Kent Mercker
20.6 Francisco Cordero
21.3 Jeremy Affeldt
21.4 Todd Coffey
21.9 Bill Bray
22.2 David Weathers
22.8 Jared Burton

Rockies
20.7 Franklin Morales
21.0 Mark Redman
21.1 Aaron Cook
21.6 Jeff Francis
21.9 Ubaldo Jimenez
23.3 Jorge De La Rosa
23.8 Gregory Reynolds

21.0 Alberto Arias
23.7 Taylor Buchholz
23.8 Brian Fuentes
24.9 Jason Grilli
25.3 Ryan Speier
25.7 Manny Corpas
25.7 Matt Herges
26.3 Kip Wells

Royals
20.0 Brian Bannister
20.3 John Bale
21.4 Zack Greinke
22.0 Brett Tomko
22.7 Gil Meche
23.0 Luke Hochevar
23.4 Kyle Davies

21.0 Joakim Soria
22.5 Ron Mahay
23.4 Yasuhiko Yabuta
24.6 Ramon Ramirez
26.0 Jimmy Gobble
27.1 Leo Nunez
29.0 Joel Peralta

Tigers
20.4 Justin Verlander
20.9 Nate Robertson
21.1 Dontrelle Willis
23.2 Armando Galarraga
24.2 Kenny Rogers
24.7 Jeremy Bonderman

21.0 Todd Jones
22.1 Aquilino Lopez
22.7 Zach Miner
24.5 Freddy Dolsi
25.9 Francisco Cruceta
26.8 Bobby Seay
27.6 Denny Bautista

Twins
20.2 Glen Perkins
21.3 Kevin Slowey
21.4 Nick Blackburn
21.8 Francisco Liriano
22.1 Livan Hernandez
23.2 Boof Bonser
24.3 Scott Baker

22.1 Brian Bass
23.2 Matt Guerrier
23.4 Pat Neshek
24.4 Juan Rincon
24.9 Dennys Reyes
26.5 Jesse Crain
26.8 Joe Nathan

White Sox
17.2 Mark Buehrle
20.5 John Danks
21.6 Gavin Floyd
22.8 Jose Contreras
22.9 Javier Vazquez

20.3 Scott Linebrink
20.7 Matt Thornton
20.8 Nick Masset
21.4 Boone Logan
22.7 Octavio Dotel
23.9 Bobby Jenks

Yankees
21.3 Darrell Rasner
22.1 Andy Pettitte
24.0 Ian Kennedy
25.1 Phil Hughes
25.7 Mike Mussina
26.6 Chien-Ming Wang

22.7 Mariano Rivera
22.8 Kyle Farnsworth
24.4 Jose Veras
25.0 Edwar Ramirez
25.1 Joba Chamberlain
25.5 Jonathan Albaladejo
25.8 Brian Bruney
26.3 Ross Ohlendorf
26.7 LaTroy Hawkins

Note: links to updated versions of scripts and database structure are found at the end of this post. Also, I highly recommend using XAMPP (or Mac equivalent) to install all this software with one easy installer rather than installing Perl, MySQL, etc., piecemeal.

One of my hopes for this project is to share the pitch data in a way that facilitates the analysis of others. There are many others out there whose knowledge of statistics, data analysis, graphics, physics, and/or baseball far exceeds mine.

My pitch database itself is too large a file to easily share, so I’ll do the next best thing, what I hope may be a better thing after all, to try to document the process I used to create it so that others can create a pitch database for themselves. This post may be an ongoing work as I both recreate what I have already done and lay out what work remains for me to do.

First of all, if you just want to get your feet wet using Microsoft Excel to analyze a single game’s worth of pitch-by-pitch XML data, Dr. Alan Nathan has laid out the steps for you at his Physics of Baseball site.

If you desire a whole season’s worth of pitch data (over half a million pitches) stored in a relational database, with visions of all sorts of wonderful analysis that would enable, follow me!

Downloading the Data

The first place to start is with downloading the XML data from Major League Baseball’s Gameday website. But if we’re going to download thousands of games, each with hundreds of pitches, that’s not something we want to do manually. Fortunately, we can leverage a very useful book by Joseph Adler, Baseball Hacks, published by O’Reilly in January 2006. Some parts of his hacks are outdated or nonfunctional, but other parts I found very useful for this project.

Whether or not you want to buy the book, the Perl scripts he wrote for the book can be downloaded from the examples section of O’Reilly’s website. Download and uncompress the baseball_hacks_code.zip file. It contains all the scripts from the book, divided by chapter and hack number. The first script of interest for XML downloading is hack_28_spider.pl, found in Chapter 3. This script can be used after only a few minor modifications, as follows:

To download the 2007 season starting with April 2, change Line 35 from
$start = timelocal(0,0,0,20,6,105);
to
$start = timelocal(0,0,0,2,3,107);

Similarly, you need to change the end time. Uncomment Line 40 and comment Line 41:
$now = timelocal(0,0,0,$mday - 1,$mon,$year);
#$now = timelocal(0,0,0,3,10,105);

These statements determines the first and last dates between which the XML game files will be downloaded. The statements use the function timelocal(seconds, minutes, hours, days, months, years) where days is the day of the month 1-31, months is the month 0-11 (January=0, December=11), and years are since 1900 (2007 = 107).

Another change is to correct for the fact that MLB now puts player information in an XML file rather than a TXT file.

Change the players.txt in Lines 95, 96, 101, and 102 to players.xml:
from
if($gamehtml =~ m/<a href=\"players\.txt\"/ ) {
$plyrurl = "$dayurl/$game/players.txt";
$response = $browser->get($plyrurl);
die "Couldn't get $plyrurl: ", $response->status_line, "\n"
unless $response->is_success;
$plyrhtml = $response->content;
open PLYRS, ">$gamedir/players.txt"
or die "could not open file $gamedir/players.txt: $|\n";
to
if($gamehtml =~ m/<a href=\"players\.xml\"/ ) {
$plyrurl = "$dayurl/$game/players.xml";
$response = $browser->get($plyrurl);
die "Couldn't get $plyrurl: ", $response->status_line, "\n"
unless $response->is_success;
$plyrhtml = $response->content;
open PLYRS, ">$gamedir/players.xml"
or die "could not open file $gamedir/players.xml: $|\n";

Fortunately, you don’t need to change any code to download the pitch-by-pitch files since the filenames remain unchanged from the 2005 season, even though the data content has changed.

Before you can run it, though, you need to have Perl installed on your computer. You can download the Perl binaries and package manager for ActivePerl here.

Once you get Perl installed you should be ready to go. I made a separate directory to hold my game data. If you’re running from Windows, just open a command prompt window, cd to your game data directory, and then run the hack_28_spider.pl script.

I find that it works best to run the spider in off-peak times. Some people have reported connection problems when they try to run in peak times (during the business day or game time on the East Coast), but I haven’t seen any such problems off-peak. Plus, that’s a more respectful use of MLB’s bandwidth. If bandwidth cost is a factor for MLB, the less cost we impose on them, the more likely they are to continue to make the XML data available for free.

Creating the Database and Installing Software

The next steps are setting up a database and adapting the hack_28_parser.pl script to input the data into the database.

Here is a link to my PBP database structure. This is a MySQL database. If you are already familiar with MySQL, then this will give you a head start.

Also, here is a link to the code from my XML-to-database parser script. It’s still a work in progress, so please don’t mind the mess inside, but it works if you want to use it in its current state. I’ve fixed it up a bit from an earlier version I posted, cleaning up some of the subroutine calls. (Note: the copy now posted is yet a newer version that also processes umpires and fielding locations for balls in play.)

Running the XML-to-database parser script will require the Perl DBI::mysql package to be installed. If you don’t have that already, open the Perl Package Manager (under the Active Perl program group if you installed Active Perl). Under the View menu, select All Packages, and look for DBI and DBD-mysql. If they aren’t listed as installed, click on them to select them, and then go to the Action menu and Install them.

Now on to downloading and installing MySQL. Here is the MySQL 5.0 download site. If you are downloading for Windows, download the installer, and accept the default settings. You will need to set up a username and password for MySQL. Remember these since you will need to use them in the XML-to-database parser script.

MySQL has a command-line interface from which you can do everything you need. However, I don’t find it very easy to do things that way, and if you don’t either, you can install an administration interface. There are all sorts of them out there. I think MySQL even offers one on their site. The one I use is PHPMyAdmin, which is a web-based admin utility for MySQL.

Using PHPMyAdmin requires having a webserver (e.g., Apache or Microsoft IIS) running and having PHP installed. I’m going to assume for the moment that you either have an alternate administration interface for MySQL, or already have PHPMyAdmin installed. For me, the collective installation process for MySQL, Apache HTTP Server, PHP, and PHPMyAdmin was one of the biggest challenges of this project. Thus, I believe it deserves a full explanation. However, it’s something I’m just not getting around to fully documenting at this time, and I don’t want that to keep me from moving on to some other interesting aspects of the project since I know some people are already past this point. Although it isn’t the way I did it, XAMPP offers a packaged install of Perl, MySQL, Apache, PHP, and PHPMyAdmin, and I believe some people have had success with that installation path.

Once you have created your database and you’re updating it periodically by parsing the downloaded data, how do you use the data?

Analyzing the Data

The first step is running a query to get a data set from the database for analysis. Here is a link to an example query. Database queries are written in a language called SQL. The query in my example pulls all the fields from several tables for every pitch thrown this season by Jeremy Guthrie.

You can export the data from a query to Microsoft Excel. PHPMyAdmin will save the query output to a text file with fields delimited by whatever character you choose. You can then import this file into Microsoft Excel for sorting and graphing.

If, however, you find the graphing capabilities of Excel a little limiting, as I do, you may want to explore the statistical package R. R is an open-source software that has much more advanced graphing capabilities than Excel. You can also run your SQL queries directly from R and easily manipulate and partition your data sets. I’m still learning quite a bit about graphing in R as I go. The pitch speed graphs for my Brandon Webb article were made in R. They are nothing special in appearance, but they were much easier to make in R than they would have been to make in Excel from the same data set. Based on appearances, Joe P. Sheehan and a few of the other PITCHf/x researchers are also using R for their work. Joseph Adler has a chapter on installing and using R in Baseball Hacks.

Here is an example analysis using R to make a couple plots from games by Brandon Webb, one like the pitch speed graphs like I did in my post about Webb and another graphing the pitch speed versus horizontal break.

I’m just in the beginning stages myself here, so hopefully I can add more about analysis as time goes on, or you can take a look at what other people have been doing. Dr. Nathan’s analysis of a Jon Lester start shows a lot of promise in terms of classifying pitches by speed and spin direction.

What else?

I want to add umpire data and hit location data to the pitch database. One of these days I will get around to that. Harry Pavlidis at Cubs F/X has added umpire data to the database, so take a look at his work if you are interested in that. He has a copy of his script posted for you to use. Thanks, Harry!

Edit: the new version of the XML-to-database parser script adds umpire data and hit location data to the database. I have updated the PBP database structure to add the hit location fields to the at_bats table.

Note: Some of you have noted connection problems with the spider script. Kris Gardham informs me, “For those people getting connection problems, I think it’s more of a DNS thing than a MLB thing. Hard coding the IP of gd2 or gd.mlb.com into the script really seemed to speed things up.”

Edit: There are new fields available in the 2008 data. The sv_id field is a date-time stamp of when the pitch was thrown, the pitch_type is the MLBAM algorithm’s best guess at the pitch type, and type_confidence is the confidence value associated with that guess. Here is my new database structure for 2008 with these fields added to the pitch table. You can download the new database parser script to use these fields. I have an additional script to update the pitches table with the ball-strike count at each pitch. You will also need to change the spider script to download the game.xml file if you want to take advantage of the additional game info that I am parsing from that file.

The spider needs to be changed to download the game.xml file. Since the game.xml file is in the same directory as the boxscore.xml file, you can duplicate that section of the code and change “boxscore” to “game”, like so:

if($gamehtml =~ m/<a>get($gameurl);
die “Couldn’t get $gameurl: “, $response->status_line, “\n”
unless $response->is_success;
$infohtml = $response->content;
open GAME, “>$gamedir/game.xml”
or die “could not open file $gamedir/game.xml: $|\n”;
print GAME $infohtml;
close GAME;
} else {
print “warning: no xml game file for $game\n”;
}

UPDATE:  See this tutorial for installing a PITCHf/x database on a Mac, based on my tutorial: http://www.beyondtheboxscore.com/2009/8/19/994666/saberizing-a-mac-4-pitch-f-x.  The comments are worth reading even if you are using a PC.  In the comments I included links to the latest versions of my code.

Spider script: http://codepaste.net/ppw1oo (updated in May 2011 – with new code to address timeout issues, courtesy of Matthew Bultitude)

Database structure: http://codepaste.net/aoaog9 (updated in May 2011)

New database structure with fielder info: http://codepaste.net/anur1y (updated in September 2011)

Database parser script: http://codepaste.net/hpdz3z

New database parser script that will handle fielder info: http://codepaste.net/nd7ggn  (updated in September 2011)

Add-on script for updating balls and strikes: http://codepaste.net/ha2vc2

Add-on script for updating fielder information by inning: http://codepaste.net/qmjv8k (added in September 2011)