Note: links to updated versions of scripts and database structure are found at the end of this post. Also, I highly recommend using XAMPP (or Mac equivalent) to install all this software with one easy installer rather than installing Perl, MySQL, etc., piecemeal.
One of my hopes for this project is to share the pitch data in a way that facilitates the analysis of others. There are many others out there whose knowledge of statistics, data analysis, graphics, physics, and/or baseball far exceeds mine.
My pitch database itself is too large a file to easily share, so I’ll do the next best thing, what I hope may be a better thing after all, to try to document the process I used to create it so that others can create a pitch database for themselves. This post may be an ongoing work as I both recreate what I have already done and lay out what work remains for me to do.
First of all, if you just want to get your feet wet using Microsoft Excel to analyze a single game’s worth of pitch-by-pitch XML data, Dr. Alan Nathan has laid out the steps for you at his Physics of Baseball site.
If you desire a whole season’s worth of pitch data (over half a million pitches) stored in a relational database, with visions of all sorts of wonderful analysis that would enable, follow me!
Downloading the Data
The first place to start is with downloading the XML data from Major League Baseball’s Gameday website. But if we’re going to download thousands of games, each with hundreds of pitches, that’s not something we want to do manually. Fortunately, we can leverage a very useful book by Joseph Adler, Baseball Hacks, published by O’Reilly in January 2006. Some parts of his hacks are outdated or nonfunctional, but other parts I found very useful for this project.
Whether or not you want to buy the book, the Perl scripts he wrote for the book can be downloaded from the examples section of O’Reilly’s website. Download and uncompress the baseball_hacks_code.zip file. It contains all the scripts from the book, divided by chapter and hack number. The first script of interest for XML downloading is hack_28_spider.pl, found in Chapter 3. This script can be used after only a few minor modifications, as follows:
To download the 2007 season starting with April 2, change Line 35 from
$start = timelocal(0,0,0,20,6,105);
to
$start = timelocal(0,0,0,2,3,107);
Similarly, you need to change the end time. Uncomment Line 40 and comment Line 41:
$now = timelocal(0,0,0,$mday - 1,$mon,$year);
#$now = timelocal(0,0,0,3,10,105);
These statements determines the first and last dates between which the XML game files will be downloaded. The statements use the function timelocal(seconds, minutes, hours, days, months, years) where days is the day of the month 1-31, months is the month 0-11 (January=0, December=11), and years are since 1900 (2007 = 107).
Another change is to correct for the fact that MLB now puts player information in an XML file rather than a TXT file.
Change the players.txt in Lines 95, 96, 101, and 102 to players.xml:
from
if($gamehtml =~ m/<a href=\"players\.txt\"/ ) {
to
$plyrurl = "$dayurl/$game/players.txt";
$response = $browser->get($plyrurl);
die "Couldn't get $plyrurl: ", $response->status_line, "\n"
unless $response->is_success;
$plyrhtml = $response->content;
open PLYRS, ">$gamedir/players.txt"
or die "could not open file $gamedir/players.txt: $|\n";
if($gamehtml =~ m/<a href=\"players\.xml\"/ ) {
$plyrurl = "$dayurl/$game/players.xml";
$response = $browser->get($plyrurl);
die "Couldn't get $plyrurl: ", $response->status_line, "\n"
unless $response->is_success;
$plyrhtml = $response->content;
open PLYRS, ">$gamedir/players.xml"
or die "could not open file $gamedir/players.xml: $|\n";
Fortunately, you don’t need to change any code to download the pitch-by-pitch files since the filenames remain unchanged from the 2005 season, even though the data content has changed.
Before you can run it, though, you need to have Perl installed on your computer. You can download the Perl binaries and package manager for ActivePerl here.
Once you get Perl installed you should be ready to go. I made a separate directory to hold my game data. If you’re running from Windows, just open a command prompt window, cd to your game data directory, and then run the hack_28_spider.pl script.
I find that it works best to run the spider in off-peak times. Some people have reported connection problems when they try to run in peak times (during the business day or game time on the East Coast), but I haven’t seen any such problems off-peak. Plus, that’s a more respectful use of MLB’s bandwidth. If bandwidth cost is a factor for MLB, the less cost we impose on them, the more likely they are to continue to make the XML data available for free.
Creating the Database and Installing Software
The next steps are setting up a database and adapting the hack_28_parser.pl script to input the data into the database.
Here is a link to my PBP database structure. This is a MySQL database. If you are already familiar with MySQL, then this will give you a head start.
Also, here is a link to the code from my XML-to-database parser script. It’s still a work in progress, so please don’t mind the mess inside, but it works if you want to use it in its current state. I’ve fixed it up a bit from an earlier version I posted, cleaning up some of the subroutine calls. (Note: the copy now posted is yet a newer version that also processes umpires and fielding locations for balls in play.)
Running the XML-to-database parser script will require the Perl DBI::mysql package to be installed. If you don’t have that already, open the Perl Package Manager (under the Active Perl program group if you installed Active Perl). Under the View menu, select All Packages, and look for DBI and DBD-mysql. If they aren’t listed as installed, click on them to select them, and then go to the Action menu and Install them.
Now on to downloading and installing MySQL. Here is the MySQL 5.0 download site. If you are downloading for Windows, download the installer, and accept the default settings. You will need to set up a username and password for MySQL. Remember these since you will need to use them in the XML-to-database parser script.
MySQL has a command-line interface from which you can do everything you need. However, I don’t find it very easy to do things that way, and if you don’t either, you can install an administration interface. There are all sorts of them out there. I think MySQL even offers one on their site. The one I use is PHPMyAdmin, which is a web-based admin utility for MySQL.
Using PHPMyAdmin requires having a webserver (e.g., Apache or Microsoft IIS) running and having PHP installed. I’m going to assume for the moment that you either have an alternate administration interface for MySQL, or already have PHPMyAdmin installed. For me, the collective installation process for MySQL, Apache HTTP Server, PHP, and PHPMyAdmin was one of the biggest challenges of this project. Thus, I believe it deserves a full explanation. However, it’s something I’m just not getting around to fully documenting at this time, and I don’t want that to keep me from moving on to some other interesting aspects of the project since I know some people are already past this point. Although it isn’t the way I did it, XAMPP offers a packaged install of Perl, MySQL, Apache, PHP, and PHPMyAdmin, and I believe some people have had success with that installation path.
Once you have created your database and you’re updating it periodically by parsing the downloaded data, how do you use the data?
Analyzing the Data
The first step is running a query to get a data set from the database for analysis. Here is a link to an example query. Database queries are written in a language called SQL. The query in my example pulls all the fields from several tables for every pitch thrown this season by Jeremy Guthrie.
You can export the data from a query to Microsoft Excel. PHPMyAdmin will save the query output to a text file with fields delimited by whatever character you choose. You can then import this file into Microsoft Excel for sorting and graphing.
If, however, you find the graphing capabilities of Excel a little limiting, as I do, you may want to explore the statistical package R. R is an open-source software that has much more advanced graphing capabilities than Excel. You can also run your SQL queries directly from R and easily manipulate and partition your data sets. I’m still learning quite a bit about graphing in R as I go. The pitch speed graphs for my Brandon Webb article were made in R. They are nothing special in appearance, but they were much easier to make in R than they would have been to make in Excel from the same data set. Based on appearances, Joe P. Sheehan and a few of the other PITCHf/x researchers are also using R for their work. Joseph Adler has a chapter on installing and using R in Baseball Hacks.
Here is an example analysis using R to make a couple plots from games by Brandon Webb, one like the pitch speed graphs like I did in my post about Webb and another graphing the pitch speed versus horizontal break.
I’m just in the beginning stages myself here, so hopefully I can add more about analysis as time goes on, or you can take a look at what other people have been doing. Dr. Nathan’s analysis of a Jon Lester start shows a lot of promise in terms of classifying pitches by speed and spin direction.
What else?
I want to add umpire data and hit location data to the pitch database. One of these days I will get around to that. Harry Pavlidis at Cubs F/X has added umpire data to the database, so take a look at his work if you are interested in that. He has a copy of his script posted for you to use. Thanks, Harry!
Edit: the new version of the XML-to-database parser script adds umpire data and hit location data to the database. I have updated the PBP database structure to add the hit location fields to the at_bats table.
Note: Some of you have noted connection problems with the spider script. Kris Gardham informs me, “For those people getting connection problems, I think it’s more of a DNS thing than a MLB thing. Hard coding the IP of gd2 or gd.mlb.com into the script really seemed to speed things up.”
Edit: There are new fields available in the 2008 data. The sv_id field is a date-time stamp of when the pitch was thrown, the pitch_type is the MLBAM algorithm’s best guess at the pitch type, and type_confidence is the confidence value associated with that guess. Here is my new database structure for 2008 with these fields added to the pitch table. You can download the new database parser script to use these fields. I have an additional script to update the pitches table with the ball-strike count at each pitch. You will also need to change the spider script to download the game.xml file if you want to take advantage of the additional game info that I am parsing from that file.
The spider needs to be changed to download the game.xml file. Since the game.xml file is in the same directory as the boxscore.xml file, you can duplicate that section of the code and change “boxscore” to “game”, like so:
if($gamehtml =~ m/<a>get($gameurl);
die “Couldn’t get $gameurl: “, $response->status_line, “\n”
unless $response->is_success;
$infohtml = $response->content;
open GAME, “>$gamedir/game.xml”
or die “could not open file $gamedir/game.xml: $|\n”;
print GAME $infohtml;
close GAME;
} else {
print “warning: no xml game file for $game\n”;
}
UPDATE: See this tutorial for installing a PITCHf/x database on a Mac, based on my tutorial: http://www.beyondtheboxscore.com/2009/8/19/994666/saberizing-a-mac-4-pitch-f-x. The comments are worth reading even if you are using a PC. In the comments I included links to the latest versions of my code.
Spider script: http://codepaste.net/ppw1oo (updated in May 2011 – with new code to address timeout issues, courtesy of Matthew Bultitude)
Database structure: http://codepaste.net/aoaog9 (updated in May 2011)
New database structure with fielder info: http://codepaste.net/anur1y (updated in September 2011)
Database parser script: http://codepaste.net/hpdz3z
New database parser script that will handle fielder info: http://codepaste.net/nd7ggn (updated in September 2011)
Add-on script for updating balls and strikes: http://codepaste.net/ha2vc2
Add-on script for updating fielder information by inning: http://codepaste.net/qmjv8k (added in September 2011)
August 25, 2007 at 10:58 am
M – I downloaded the files last night, and the code for the pbp directories was already included. My run’s almost done …..
August 25, 2007 at 3:40 pm
This is great – but a couple fields didn’t match in your parser, so I added them to the db, and one primary key had to be changed. Now the parser is running like mad. Thanks so much for all your effort.
August 25, 2007 at 5:38 pm
Yep, I just realized the MySQL table structure I posted was old, so I’ll update it with the new one. I’m also working on a code update that I hope to post later tonight. Functionally it’s the same, but it looks prettier and is organized better.
August 26, 2007 at 10:41 am
I’m cooking with gas, or something like that. http://cubsfx.blogspot.com/2007/08/just-example.html
August 27, 2007 at 11:35 am
Mike, you are THE MAN. The next few weeks may elevate you to HERO. Keep up all the great work.
August 27, 2007 at 5:23 pm
I’ve posted a couple scripts to load the umpires @ http://cubsfx.blogspot.com/
Thanks again, Mike. I’m working on an analysis at pitching/hitting out of the zone, some prelim stuff already posted, and I wouldn’t be doing that now without your work.
August 28, 2007 at 3:11 pm
>> $start = timelocal(0,0,0,2,3,107);
Wouldn’t April 2nd really be (0,0,0,2,4,107)? Or does the numbering happen to start at 0 for months?
And for any other Mac users out there, I believe all the necessary software either comes with the OS (perl) or can be downloaded as a a free bundle in the form of MAMP (Mac Apache MySQL PHP) from mamp.info
August 28, 2007 at 4:31 pm
Months are numbered 0-11 in Perl so April = 3.
August 31, 2007 at 1:40 pm
If anyone reading this is Mac proficient and can offer some perl assistance (I’m getting an error that says “Can’t locate DBI.pm in @INC…” when I try to run the perl script for hack 28), I’d really appreciate it.
skyking162 — gmail
Thanks!
August 31, 2007 at 3:12 pm
Sky, it sounds like you don’t have the DBI::mysql package for Perl and need to download it. If you can find your Perl Package Manager, run it, select View->All Packages, and look for the DBI and DBD-mysql packages. If they aren’t installed yet, click on them and select Action->Install to install them. That should take care of your problem.
I’m pretty sure PPM comes with the Mac OS X Perl distribution. If you have trouble finding it, maybe one of the other Mac users could chime in.
August 31, 2007 at 4:37 pm
Mike – Thanks for the tip on R, I’m going to try it out – doing the sql within R is a good thing, or so it sounds.
September 2, 2007 at 12:49 am
Yeah, R’s pretty powerful, though it’s got a bit of a learning curve. Once you get the hang of it, though, you’ll find it’s a nice environment, especially for stats.
Mysql, on the other hand.. I have nothing good to say about it. That said, you might as well keep this stuff in it. You could also store it in Berkeley DB tables, or flat text files, or whatever. Still, if you like the sql interface, I’d really really really recommend something that’s more standards-compliant and more portable. Postgresql is my poison, but there is plenty out there from which to choose.
Thanks for posting the scripts, though. Perhaps if I have some downtime this weekend I can try ’em out.
September 2, 2007 at 6:59 am
What if I want a database of the a particular teams pitchers and not the whole thing? Is there a way to manipulate the script so that I could get one team?
September 2, 2007 at 7:19 am
John, if you want just one team, you can modify the download script to just download XML files for that team’s games.
Just modify the regular expression in Line 63 of the download script so that it only matches on your team of interest.
while($html =~ m/<a href=\”(gid_\w+\/)\”/g ) {
push @games, $1;
}
to
while($html =~ m/<a href=\”(gid_\w+_seamlb_\w+\/)\”/g ) {
push @games, $1;
}
if you want only Seattle games, for example.
September 2, 2007 at 8:30 am
Thanks 🙂
September 2, 2007 at 9:42 am
I’ve consolidated my version Mike’s parser with the umpire stuff into a single file.
http://cubsfx.blogspot.com/2007/09/data-loader.html
As always, make a back-up of your data….
September 2, 2007 at 10:59 am
😦 I tried running it but keep getting connection errors…im going to have to try at night.
September 2, 2007 at 10:33 pm
Mike, love the work. I did the script for a spefic team and I so got dates and no data inside. When I used the regular script it worked fine. Help please, THANKS!
September 2, 2007 at 10:39 pm
Benji, my comment was missing a slash before the second \w+. (The \w indicates a “word” in Perl regular expressions, w without the slash is just the letter “w”.) I fixed my comment above to reflect the correct syntax.
September 2, 2007 at 11:00 pm
It works!
The thing is when you do just one team you still get the other players so I am going to have to filter that some how. Any ideas?
Thanks
September 3, 2007 at 4:57 am
I tried running it at 3am and still got connection issues. Not sure whats wrong.
September 3, 2007 at 7:19 am
[…] How to build a pitch database « Fast Balls Information on automating the process of getting PITCHf/x files. Beats the hell out of downloading the XML files by hand. (tags: baseball pitchfx statistics sports analysis xml perl database) […]
September 3, 2007 at 7:56 am
Mike I have the spider running through the 11th month but it stops at 7-11
September 3, 2007 at 8:10 am
Attnblog, there should be a way to modify the code to exclude the visiting players, but it takes diving into the code a little more, and that’s outside the scope of my project. If you or someone else decides to tackle that, you’re welcome to post the result here.
The end date for downloading by the spider is set by the $now variable.
September 3, 2007 at 8:34 am
I put the now variable to this $now = timelocal(0,0,0,11,10,107) but the spider is only going to month 7 when there is data all the way up to september
September 3, 2007 at 9:14 am
Also, i tried running this code on different teams and whenever it gets to a date it can’t find it stops. For example, I had the spider go get data from april and it returns Can’t get … 404 not found
September 3, 2007 at 10:33 am
Ok, I worked through some issues and definitely installed the proper packages via PPM. However, the script still can’t find DBI. PPM installed DBI in this directory:
/Library/ActivePerl/etc
But here’s the error I get when I run the hack:
Can’t locate DBI.pm in @INC (@INC contains: /System/Library/Perl/5.8.1/darwin-thread-multi-2level /System/Library/Perl/5.8.1 /Library/Perl/5.8.1/darwin-thread-multi-2level /Library/Perl/5.8.1 /Library/Perl /Network/Library/Perl/5.8.1/darwin-thread-multi-2level /Network/Library/Perl/5.8.1 /Network/Library/Perl .) at hack_28_SKY.pl line 21.
What is @INC? Do I need to somehow include the path to ActivePerl as part of @INC? I tried to copy the ActivePerl folder into the /Library/Perl folder, but that didn’t work.
Anybody have any tips?
September 3, 2007 at 4:16 pm
Sky, you either need to change your path to include /Library/ActivePerl/etc in the path (or whatever directory contains the DBI and DBD directories), or you need to put the DBI and DBD directories into a directory already in your path, e.g., /Library/Perl/5.8.1 /Library/Perl.
Perl will look in the path for a file named DBI.pm or a directory named DBI with *.pm files inside it. When it doesn’t find either, it simply reports that it can’t find DBI.pm in the path.
September 4, 2007 at 7:27 am
Attnblog, I remember now that the spider will die at the All Star break because there is no data for a couple days there. Change the start date to the first day of games after the All Star break and it should start up again and do the rest of the season.
Probably the same issue is going on if you’re just trying to get a specific team and they have an off day.
You could probably fix the code to be more intelligent about missing days, but that’s where it’s at right now.
September 4, 2007 at 1:15 pm
Yes, but HOW do I change the path to include /Library/ActivePerl? And will just adding that one path allow my computer to look inside any directories within /Library/ActivePerl or do I need to add the entire, specific path of the DBI and DBD files?
I already tried the opposite method — dragging and dropping the DBI and DBD files from their current folder to the /Library/Perl folder, but that didn’t work.
September 5, 2007 at 4:24 pm
I am trying to use perl for more.
I am trying to use a spider to download all the minor league data. Any help with the spider would be appericated
September 9, 2007 at 7:13 pm
Mike,
Thanks for the great how-to. I’m wondering how you deal with inconsistencies in the data.
For example, on 7/14 in the Colorado/Milwaukee game, there’s a pitching change in the 4th inning during the middle of Helton’s at-bat which causes all the pitches to be assigned to Gallardo rather than Sheets – and it causes the parser to fail because there was an error during the at-bat.
I’m not sure how to get the pitches assigned to the proper pitcher beyond manually splitting it into multiple at-bats, which seems wrong. If you don’t mind me asking, how did you handle that situation?
Thanks again,
Dan Turkenkopf
September 9, 2007 at 8:17 pm
Dan, as it stands now, the parser assigns all the pitches to whichever pitcher’s id is associated with that at bat in the XML file, in this case Gallardo. The parser does not look at the “action” fields.
Adler has code in his version of the parser that looks at the action fields in order to parse everything into Retrosheet event format, so maybe I can leverage that in some fashion. I know when I first looked through it that Adler’s original parser code does not work as it is since there have been so many changes to the Gameday XML files from 2005 to now.
It’s something that needs to be fixed, and I appreciate you bringing it to my attention. Depending on how many times it happens in the season, I may elect to do what you mentioned, which is to split the at bat. Even though that’s not an ideal solution, my guess is that it happens only a handful of times, and I hate to add a batter and pitcher field to each pitch just for a few exceptions.
September 10, 2007 at 9:26 am
Mike,
Thanks for the info. I don’t think the problem is the parser code. I think the bigger problem is actually data quality – MLB assigns all pitches to the reliever rather than who actually threw them. It probably doesn’t happen all too frequently (in this case, I’ll probably just remove the error I think), but it’s still a item of concern.
September 11, 2007 at 1:48 pm
Dan, here is a list of all the instances of mid-at-bat batter or pitcher substitution this year, through yesterday (September 10). There are 27 occurrences for pitchers and 18 occurrences for batters.
I searched all the action fields for event eq “Pitching Substitution” or “Offensive sub” and either ball or strike count greater than zero.
Date Home Away Inn Ball Strike Bat/Pit EliasID-of-Sub
‘2007-04-05’ ‘was’ ‘ari’ 1 2 2 batter 425631
‘2007-04-05’ ‘sfn’ ‘sdn’ 9 1 0 batter 425468
‘2007-05-03’ ‘sfn’ ‘phi’ 5 0 1 batter 456422
‘2007-05-30’ ‘chn’ ‘flo’ 5 0 2 batter 425286
‘2007-05-31’ ‘tor’ ‘cha’ 6 2 2 batter 407880
‘2007-06-11’ ‘chn’ ‘hou’ 8 0 1 batter 348563
‘2007-06-16’ ‘lan’ ‘ana’ 7 2 0 batter 425477
‘2007-07-01’ ‘flo’ ‘atl’ 9 1 2 batter 434681
‘2007-07-06’ ‘lan’ ‘flo’ 10 0 1 batter 430971
‘2007-07-07’ ‘sln’ ‘sfn’ 7 1 0 batter 276534
‘2007-07-20’ ‘lan’ ‘nyn’ 4 2 2 batter 430905
‘2007-07-24’ ‘ana’ ‘oak’ 8 0 1 batter 136767
‘2007-07-28’ ‘kca’ ‘tex’ 9 1 0 batter 121058
‘2007-08-18’ ‘was’ ‘nyn’ 6 1 2 batter 115094
‘2007-08-22’ ‘cin’ ‘atl’ 1 1 0 batter 488862
‘2007-09-07’ ‘col’ ‘sdn’ 7 0 2 batter 276054
‘2007-09-10’ ‘det’ ‘tor’ 2 0 1 batter 425659
‘2007-09-10’ ‘flo’ ‘was’ 7 2 0 batter 425508
‘2007-04-10’ ‘flo’ ‘mil’ 5 2 0 pitcher 421104
‘2007-04-13’ ‘atl’ ‘flo’ 7 1 0 pitcher 434680
‘2007-04-18’ ‘sea’ ‘min’ 1 3 1 pitcher 430646
‘2007-04-22’ ‘tex’ ‘oak’ 9 3 2 pitcher 430788
‘2007-05-02’ ‘cle’ ‘tor’ 2 2 2 pitcher 425507
‘2007-05-02’ ‘sdn’ ‘was’ 3 1 2 pitcher 111492
‘2007-05-06’ ‘nya’ ‘sea’ 7 1 0 pitcher 430675
‘2007-05-12’ ‘was’ ‘flo’ 5 2 2 pitcher 446453
‘2007-05-25’ ‘flo’ ‘nyn’ 8 3 2 pitcher 434637
‘2007-06-01’ ‘phi’ ‘sfn’ 4 1 1 pitcher 489448
‘2007-06-13’ ‘kca’ ‘sln’ 4 0 2 pitcher 407908
‘2007-06-23’ ‘atl’ ‘det’ 7 1 2 pitcher 407884
‘2007-07-14’ ‘mil’ ‘col’ 4 3 2 pitcher 451596
‘2007-07-29’ ‘phi’ ‘pit’ 8 1 2 pitcher 110134
‘2007-08-01’ ‘chn’ ‘phi’ 5 1 0 pitcher 113961
‘2007-08-01’ ‘tba’ ‘tor’ 4 2 1 pitcher 466838
‘2007-08-02’ ‘flo’ ‘col’ 5 0 1 pitcher 434637
‘2007-08-18’ ‘chn’ ‘sln’ 8 1 2 pitcher 450320
‘2007-08-30’ ‘nya’ ‘bos’ 9 3 1 pitcher 469735
‘2007-09-01’ ‘sln’ ‘cin’ 9 1 2 pitcher 459376
‘2007-09-05’ ‘ari’ ‘sdn’ 6 1 1 pitcher 430963
‘2007-09-06’ ‘bal’ ‘bos’ 4 0 2 pitcher 457122
‘2007-09-07’ ‘bal’ ‘bos’ 4 2 0 pitcher 434638
‘2007-09-07’ ‘col’ ‘sdn’ 3 2 0 pitcher 150230
‘2007-09-09’ ‘tex’ ‘oak’ 2 2 1 pitcher 407911
‘2007-09-10’ ‘det’ ‘tor’ 8 1 1 pitcher 435261
‘2007-09-10’ ‘flo’ ‘was’ 7 2 0 pitcher 448693
September 19, 2007 at 1:02 pm
Sky,
Try this:
$ export PATH=/usr/local/ActivePerl-5.8/bin:$PATH
That should get you to the SQL problem I’m having at least. 😉
September 20, 2007 at 3:32 am
[…] also like to take a moment express my undying admiration for people who go out of their way to share their work. It sets a precedent for an open research community, and that can start great things in motion. If […]
September 21, 2007 at 3:00 pm
I take it that this post was made because it is basically the only way to do it, correct? I downloaded Perl and extracted all this seasons data…
But I don’t feel I need a whole database, is there any way I could set up a macro in Excel so that if i put in a player’s ID such as “276371” that it would pull all of Santana’s pitches out of the folder and import them into Excel?
September 21, 2007 at 4:30 pm
It turns out all OSX users trying to use this method are SOL. There is a bug in DBD::mysql for OSX that won’t let Perl connect to the database. Sigh.
And unless I’m wrong, Excel won’t open the raw XML files either. It’s like the dark ages all over again…
October 6, 2007 at 11:52 am
I know you’ve probably forgotten about this by now, but I think that would be impossible to do in Excel. If you really wanted to avoid SQL, you could load all of Santana’s XML files into Access, but that would take a while too and it’s just another database.
And for all you Mac users out there, it looks bleak. I keep getting an error with the parsing script that can be tracked down to this bug in the DBD:mysql bundle that ActivePerl installs. If anyone knows how to get around this or manages to install a more recent version of it manually, pleeeeze let me know.
November 7, 2007 at 4:27 pm
Great stuff here, Mike, and a tremendous service to the community. I have a few questions about your database — would you be able to contact me, please? Thanks.
November 27, 2007 at 11:15 pm
Mike, you mention that the hardest part of this is getting Apache, phpAdmin, mySQL, PHP running. There are a couple free servers that bundle them, Wamp being the one I use. One install and all 4 items are set up and ready to go on Windows. A poster above mentions Mamp which I think is the same thing for Macs.
I’ll have to try the fixed script that Cubsfx guy posted but thanks for posting all this very much.
Why does Perl have to be so rediculously aweful to read and thus make people not organize it very well? Oh well. I hate Perl.
Sincerely,
Cured 5 year Perl vet
December 8, 2007 at 11:37 am
Mike
Quick question on your parser to convert the files to sql. Running the parser I get stuck on hit_record.txt file.
It executes then after a while it says cannot open hit_record.txt. In the root directory there is a file called hit_record.txt that has data in.
When I delete the hit_record.txt part of the code it all works swimmingly. Any thoughts?
It is a great public service that you’ve made this code available. It is 100% better than my more manual system!
Thanks
John
December 8, 2007 at 11:54 pm
John,
I have a comment in the code preceding the hit_record section that explains the purpose hit_record.txt file (for manual review of some situations where a ball in play and an error are recorded on the same play and the XML files are not consistent in how they record this info).
If you don’t care about reviewing those situations for accuracy, you can disable the printing of the debug information to the text file. To do this, simply comment out all the lines with “open (HITRECORD)”, “print HITRECORD”, or “close HITRECORD”.
I’m not sure why you’re getting the error where it can’t open the hit_record.txt file. If you still want to figure that out, email me at mikefast@gmail.com, and maybe we can resolve it.
December 9, 2007 at 1:08 am
Mike
Thanks for the reply. I’ll do without the hitrecords at the moment and may come back to it later
One more thing. Is there a quick way to run the parser so that it only updates files that you have recently downloaded or do you need to delete the all the xml files that you have already loaded into the database to do that?
Thanks
John
December 9, 2007 at 8:26 am
John,
There’s probably a way to handle it in the code, but the way I do it instead is to move the XML files that I’m finished parsing into a different directory.
The parser will check if it has already added games/atbats/pitches to the database and won’t add in duplicates, but the parser will still take time crawling through the files to find that out, so I move that data to a different directory once I’m done with it.
December 9, 2007 at 11:17 am
Yeah, I figured that was the easiest solution.
That HitRecord thing is odd though … I started running the parser with that code commented out and it was fine. Then I ran the next chunk of data with it uncommented and it was fine for about 75 days worth of data before becoming a problem. I deleted the file and it seemed fine. Wierd.
Anyway one final question (I’m mostly through). On July 14 I got this error:
numrow=0, no matching at bat found for hit 115732 – 282656 – 4.
I thought this was the issue that Dan referred to above but it looks like a straight forward pitching change. Any ideas?
Thanks
John
December 9, 2007 at 11:50 am
Mike — sorry, it is the same issue as Dan’s.
Have MLBAM fixed this issue?
December 9, 2007 at 4:21 pm
John,
I doubt if MLBAM considers it a problem if it works with their Gameday application but doesn’t work with my database construction.
I have not fixed these instances in my database yet.
December 10, 2007 at 11:07 pm
Mike
This is a great resource you have made public here. I cannot thank you enough for all your hard work.
I have a question on the adjustment you make. I notice that the y0 values are sometimes at 55 feet, and other times are at 50 feet and 45 feet. Presumably this affects the break and pfx_x/z numbers?
How do you ensure that you make the correct adjustments. Is there a public algortihm that you used? If you do you mind pointing me in the right direction?
Thanks so much
Al.
December 10, 2007 at 11:54 pm
Alex,
The pfx_x and pfx_z numbers are always calculated between y=40 feet and y=1.417 feet (front of home plate) regardless of the y0 value.
It does affect the start_speed and vy0, vx0, vz0 numbers, however, which are measured at the y0 point, whatever it may have been. So I adjust these numbers to y0=50 feet since that’s what Sportvision standardized on in the second half of the year.
The basic formula to make these adjustments is the distance formula x(t) = x0 + v0*t + 0.5*a*t^2. If you want to see how I applied that formula to calculate various quantities and you have Microsoft Excel, take a look at the spreadsheets I supplied with the Joba Chamberlain or Mariano Rivera analyses.
December 11, 2007 at 7:07 am
Mike
That is most helpful. Thank you very much.
Al.
February 18, 2008 at 7:17 pm
Mike, are you aware of anything published by MLB that defines the XML schema?
I’m working on trying to make these records line up with Retrosheet and I’ve already determined I’m going to need to parse ACTION elements. I’d like to try to design this upfront rather than as I go, but I’m guessing it’s not going to be possible.
BTW.. I know I promised months ago that I’d have a post on how to make this run nightly. I can put that together, but I’m waiting for a friend to buy some server space and I think I might try to get this new parser going first. If anyone’s interested sooner, let me know and I’ll pull together what I have.
February 20, 2008 at 7:02 pm
Alright.
I setup Apache, MYSQL, phpmyadmin, and anything else on the list I may have forgotten at this point. You were right, it wasn’t easy:)
So I created my tables in MYSQL, and I’m ready to start gathering data. Is there anything I need to modify in your xml/database parser except for the database connection string? What does the hack_28_spider.pl script do?
My database isn’t populating, and I’m trying to narrow down the possibilities of what’s causing the problem — whether it’s an error on my part (likely) or if I need to be using some of the additional scripts on this page like the spider.
Thanks a lot for your help on my noobish quandry.
February 21, 2008 at 9:58 am
Dan, I am not aware of any published definitions from MLB.
If you’re parsing the action elements and using Perl scripts to do it, you may already know this, but Joseph Adler’s scripts from Baseball Hacks are a big step toward that. They need to be updated to make them work, though.
February 21, 2008 at 10:04 am
Kevin, the hack_28_spider.pl downloads the data from the MLB Gameday website. If you haven’t run that, you won’t have any data to populate your database with.
I created that script using Adler’s script as the starting point. I’ve since been told that using wget is much more efficient and less prone to dropping the connection than the Perl LWP user agent. I’ll probably look into using wget for the upcoming 2008 season.
Once you have downloaded the data files from MLB, the only thing you should need to modify in the database parser is the connection string.
February 29, 2008 at 11:34 pm
Mike,
I have always used Microsoft Access and ColdFusion (from what I hear, not the best choice). I’m new to MySQL and Perl, but I have choose to use them.
I have the hack_28_parser.pl working correctly and it downloads the data to my computer. All i do to run the file is double click it within Dreamweaver. I copied your parser script and have made only a change to the username and password, but it does not populate the database.
I have DBD-mysql-informationschema, DBD-mysql-SimpleMySQL, and DBI installed. Is there an error on my part?
March 1, 2008 at 9:42 pm
Wade, what messages do you get (in the command window if you’re running Windows) when you run the parser script?
March 2, 2008 at 1:24 am
Mike,
I have used XAMPP to install Apache and MySQL. I then installed ActivePerl and installed that to the same root folder. I installed the DBI and DBD-mysql-SimpleMySQL because DBD-mysql was not on the list of “All packages.” Here is what it says:
install_driver(mysql) failed: can’t loacate DBD/mysql.pm in @INC (@INC contains: C:/Perl/site/lib C:/Perl/lib .) at (eval 4) line 3.
Perhaps the DBD::mysql perl module hasn’t been fully installed, or perhaps the capitalization of ‘mysql’ isn’t right.
Available drivers: DBM, ExampleP, File, Gofer, Proxy, SQLite, Sponge. at C:\perl\htdocs\baseball\parser.pl line 22
I have only 24 hours of experience with all of these programs and languages, so it is probably something very simple that I am missing. It sounds to me like I still need to find and install the DBD-mysql, but I have searched for 3 hours now and can’t find where to download it, or how to install it once I get it downloaded. Any help would be appreciated!
March 2, 2008 at 3:04 pm
Mike,
After searching, I found the reason why my parser script wasn’t working was due to the fact that I have ActivePerl 5.10, which doesn’t have a DBD-mysql driver available, so I unistalled and installed ActivePerl 5.6 and worked the first time.
I greatly appreciate your efforts, and thank you for your very informative website. I will keep in touch with my creations. Again, thank you!
March 3, 2008 at 12:10 am
Wade, I’m glad you figured it out, and I hope you have as much fun with all the data as I do!
March 4, 2008 at 9:45 pm
Mike,
First of all, thanks so much for writing this up and sharing it. I bought baseball hacks a while ago and was never able to put it together until now.
I used the perl script to spider the info (most of it, i think). I had some trouble with it only downloading a month at a time (I might have made a mistake with the start and end dates but I think i got all the data).
When it came time to install the activeperl I, like Wade, didnt find the right plugin in version 5.10, so I had to go all the way back to 5.6. I guess unlike Wade, my version ten was a command prompt but It had the right plugins.
I created the database and am running the parser now. Any idea of how long it should take?
I’ll have more to report when I get into the data, but I couldn’t have done it without your help.
March 4, 2008 at 11:48 pm
The parsing time is probably dependent on the horsepower of your computer. My three-year old computer did it in about 8-10 hours, if I remember correctly.
March 5, 2008 at 4:32 pm
Parsing game number 2699.
ab#207501,numrows=0, no matching at bat found for hit 115732 – 282656 – 4.
During the parsing i received this error. What do you think my best course of action, delete the game from my data and parse again?
March 6, 2008 at 3:39 pm
This is the July 14 game between the Rockies and Brewers, 4th inning, Helton vs. Sheets, yes? (I’m going by the player ID’s in the error message, and that’s the only game they faced each other.)
If so, Sheets was replaced in the middle of the at bat by Gallardo.
However, there shouldn’t be any hit information associated with that play since it was a strikeout.
I don’t remember having this problem with my data, but I’ll have to try re-parsing just this game to see if there is an issue with the code handling the mid at-bat pitching change.
March 6, 2008 at 3:46 pm
Looking at the inning_hit.xml file, I see what the hit information is–an error on a dropped foul popup. Kind of a fluke that it crossed paths with a mid at-bat pitching change.
Improperly accounting for mid at-bat pitching changes is a known bug in my code/database.
However, here it causes a different issue, which is that the hit information parser code believes that it can’t find the correct at bat to which it should assign the hit (in this case a dropped foul popup) because the result of the at bat is assigned to Gallardo and the error occurred while Sheets was still pitching.
If you want the script to run without dying due to this error, either comment out line 287
die “numrows=$numRows, no matching at bat found for hit $hit_batter – $hit_pitcher – $hit_inning.\n”;
or change it to print the error to the file:
print HITRECORD “numrows=$numRows, no matching at bat found for hit $hit_batter – $hit_pitcher – $hit_inning.\n”;
March 7, 2008 at 8:41 pm
It’s been months in the making, but I’ve finally written up how to automatically update the database nightly. The instructions to do so are embedded within this post on how to create a PITCHf/x database that can be linked to Retrosheet data.
March 24, 2008 at 10:18 pm
Hey Mike, thanks for all the info, it was very help while i was reading through the ORielly Baseball Hacks E-book.
Anyways, when i’m running your latest parser script, it seems to be starting at some point during july even though the first month i have is the 4th, with fine directory structure.
Anyways, i’m figuring its a variable error somewhere but i cant seem to pick it out. I’m not particularly great with perl, but something seems off.
I’m not sure if somewhere the 7th year is getting confused with the seventh month.
I’ve made a few debugging modifications in the code but cannot for the life of me find out why its starting at the 7th month.
March 26, 2008 at 9:50 pm
Kris,
The database parser should start with whatever the first month is. It doesn’t have a variable to set the date, other than just looking for all the month directories within the base year directory:
# Get the list of months from the base year directory
opendir MDIR, $basedir;
@monthdirs = readdir MDIR;
closedir MDIR;
foreach $mondir (@monthdirs) {
if ($mondir =~ /month/) {
opendir DDIR, “$basedir/$mondir”;
If you give me a copy of the output messages you’re getting from the script, I might be able to figure out more. Either post here or email me at Gmail address (mikefast).
March 29, 2008 at 3:45 pm
[…] year, Mike also provided his method for capturing the PITCHf/x data and storing it in a relational database. He details the steps […]
April 3, 2008 at 8:41 pm
Mike – Excellent work here. Really appreciate all the effort to pull together such a complete guide. Thanks again
April 8, 2008 at 3:41 am
[…] ones Mike Fast has in this article about Johnny Cueto’s first start. I’ve also read his tutorial on building a database for Pitch F/X data, and while my mind went numb almost immediately, it does sound like a cool idea, […]
April 12, 2008 at 10:51 pm
Thanks for this! Got a database up and running. =)
April 21, 2008 at 2:23 pm
Anybody know of any work done on getting the info into an MS SQL 2000 or 2005 db?
cheers
April 21, 2008 at 2:26 pm
Andrew, I have not heard of anyone doing that, but I would think it would be remarkably similar to putting it into MySQL. The only thing that would need to be changed in my Perl scripts is the connection string.
April 21, 2008 at 5:11 pm
Thanks for the swift response, Mr Fast
when I make the couple of amendments to the file I get an error message saying “probably
DBD:mssql perl module has not been installed.” I cannot locate on the Perl package Manager what exactly I should be adding. (I could not even see the exact mysql ones you mentioned either) Can you assist
April 21, 2008 at 8:11 pm
This link looks promising, although I have not tried any of it myself:
http://www.sommarskog.se/mssqlperl/index.html
Also, you can feel free to call me Mike.
April 22, 2008 at 9:43 pm
Hey Mike.
Thanks for all the great work.
I have already used hack_28_spider.pl and have all the xml data, but I am having trouble to get the xml-sql script to work. I know you said the only thing we really have to change is the connection string.
I am such a newbie that I can’t even locate where the connection string is and what parts we have to change. Could you explain maybe what line #’s of the script needs to be manually changed. I’ve been racking my brain for a day now trying to figure this out. Thanks!
April 23, 2008 at 9:06 am
Jason, do you have MySQL installed and have created a database already? I need more details about where you are at in the process in order to help you.
April 23, 2008 at 10:31 am
Mike,
I have MySql installed and already created the database. I have installed all of the packages that have been talked about in the previous comments. I have Baseball Hacks so I have been trying to learn through that also. Does perl and mysql have to be installed in the same folder?
Thanks
April 23, 2008 at 12:00 pm
Jason,
No, Perl and MySQL do not need to be in the same folder.
The database connection statement is this part near the very beginning of the parser script:
# MySQL database connection statement
use DBI;
$dbh = DBI->connect(“DBI:mysql:database=pbp;host=localhost”, ‘username’, ‘password’ )
or die $DBI::errstr;
You need to change the database name, the username, and the password to whatever you set those values for MySQL and your database.
In my example script, they are set as follows:
database=pbp
username
password
April 23, 2008 at 2:28 pm
Ok I finally got it to work. Basically I uninstalled everything and set it all back up. I think one of my problems was I was using perl 5.10, which other people have had problems with. So if anyone else needs to set this database up, don’t use that version.
Thank you so much for you help Mike. Your patience and diligence is very much appreciated.
Thank you again. If I do anything relevant, I’ll be sure to share.
April 23, 2008 at 9:37 pm
I’ve been trying like crazy to get this to work. I have all of the xml files, but can’t get the database script to work. I have php,mysql,perl, etc installed (bundled in apache2triad). I have installed/uninstalled/installed all of the modules discussed above, but still cant get it to dump to the database.
Maybe it’s something small I’m missing. What directory do I put the parser script in? Do I have the path set correctly? $basedir = “./MLBEXTENDED/games/year_$year”;
Any ideas would be appreciated.
Thanks
April 23, 2008 at 11:42 pm
Tyle,
What messages do you get back when you run the parser script from the command prompt or command line?
April 24, 2008 at 7:57 am
Hi Mike,
At first I was getting the same error as others:
install_driver(mysql) failed: can’t loacate DBD/mysql.pm in @INC (@INC………
Like I said, I uninstalled and reinstalled all of the modules.
Now it looks like it’s running (no messages at all), but nothing happens.
April 24, 2008 at 10:03 am
You want to run the parser from the directory that contains the XML game data, in your case, the directory that contains the MLBEXTENDED subdirectory.
If you still don’t get any output messages, try putting some debug print statements into the parser script. Something like this:
print “The script is at stage 1.\n”;
April 24, 2008 at 7:16 pm
Hi Mike,
It looks the XML::Simple module had a corrupted file. I replaced the Simple.pm file and it ran like a champ. Thanks!
April 24, 2008 at 8:22 pm
Sorry, Mike. One last question. After I ran the parser I checked the database and most of the fields in the pitches table were NULL (from start_speed on). Any ideas?
April 24, 2008 at 8:28 pm
The pitch-by-pitch data from 2007 has detailed PITCHf/x data for only about 40% of the pitches thrown during the year. Some parks didn’t get their system until after the All-Star break.
In 2008, all 30 parks have a PITCHf/x system and nearly every pitch has detailed data.
May 6, 2008 at 6:51 pm
Mike,
have you had a chance to update your parser to include pitch_type and type_confidence…or are you even interested in those fields?
Do you know of any place online that show some more info on using Gameday data to generate graphs…ie which fields are used in each type of graph? I’m still trying to understand all of the fields.
May 7, 2008 at 8:25 am
I am in the process of updating my parser to get the new fields as well as some other changes.
My PITCHf/x primer over at MVN is a good place to start.
http://mvn.com/mlb-stats/2008/01/14/a-pitchfx-primer/
May 28, 2008 at 12:41 am
Mike,
I’m having trouble with the parser. It gets stuck on the 08-02-2007 game between St. Louis and Pittsburgh, as the data from that game doesn’t include the final, walk-off single by Jose Castillo. Any ideas on how to fix or work around this?
The error message I’m getting during parsing is the following:
End tag mismatch (game_info != text_data) [Ln: 2, Col: 733]
Thanks!
shawn
May 28, 2008 at 7:22 am
By the way, I also had problems with the spider connecting to the MLB website, as the repeated hits on the same site made my router think I was part of a DNS attack. If others are having trouble with their connection to the MLB site, they can try logging into their router and temporarily disabling its DNS firewall.
May 30, 2008 at 12:41 am
I figured out the problem i posted in comment 92. The error wasn’t with either of Mike’s scripts, nor was it a problem with the data from MLB. It was a bug in the module my version of Perl was using to parse the XML. If you’re getting the same error I was getting, check out the following page for a description and a fix:
http://channel9.msdn.com/ShowPost.aspx?PostID=190722
If you’re having trouble with this particular bug, and can’t figure out the fix from that website, feel free to contact me by emailing the handle above without the “d” at gmail.
May 30, 2008 at 12:42 am
(By the way, the module that was giving me trouble was XML::SAX:PurePerl).
June 10, 2008 at 2:49 pm
When I try to run the parser script, I get this error:
dyld: lazy symbol binding failed: Symbol not found: _mysql_init
Referenced from: /Library/Perl/5.8.8/darwin-thread-multi-2level/auto/DBD/mysql/mysql.bundle
Expected in: dynamic lookup
dyld: Symbol not found: _mysql_init
Referenced from: /Library/Perl/5.8.8/darwin-thread-multi-2level/auto/DBD/mysql/mysql.bundle
Expected in: dynamic lookup
Trace/BPT trap
Does anyone know how to fix this? I’m new at this so I’m not too sure what to do. Thanks
June 15, 2008 at 8:59 am
If you want to add pitch_type field to your DB, see this.
This code is just a raw version of my xml parsing pearl file.
fix line 363 from end of line to,
foreach $pitch (@{$atbat->{pitch}}) {
# these fields are common to pitch-f/x and non-pfx data
$pitch_des = $dbh->quote($pitch->{des});
$pitch_id = $pitch->{id};
$pitch_type = $dbh->quote($pitch->{type});
$pitch_x = $pitch->{x};
$pitch_y = $pitch->{y};
$start_speed = $pitch->{start_speed};
# determine if the data for this pitch includes pitch-f/x fields
$pitchfx = 0;
if (0 {end_speed};
$sz_top = $pitch->{sz_top};
$sz_bot = $pitch->{sz_bot};
$pfx_x = $pitch->{pfx_x};
$pfx_z = $pitch->{pfx_z};
$px = $pitch->{px};
$pz = $pitch->{pz};
$x0 = $pitch->{x0};
$y0 = $pitch->{y0};
$z0 = $pitch->{z0};
$vx0 = $pitch->{vx0};
$vy0 = $pitch->{vy0};
$vz0 = $pitch->{vz0};
$ax = $pitch->{ax};
$ay = $pitch->{ay};
$az = $pitch->{az};
$break_y = $pitch->{break_y};
$break_angle = $pitch->{break_angle};
$break_length = $pitch->{break_length};
$p_type = $dbh->quote($pitch->{pitch_type});
}
$no_duplicate_query = ‘SELECT pitch_id FROM pitches WHERE (ab_id = ‘ . $select_ab_id
. ‘ AND id = ‘ . $pitch_id . ‘)’;
$sth= $dbh->prepare($no_duplicate_query) or die $DBI::errstr;
$sth->execute();
my $numRows = $sth->rows;
$sth->finish();
if ($numRows) {
# don’t insert duplicate pitch entry into pitches table
print “$select_ab_id, $pitch_id: That’s a duplicate pitch to one in the database already.\n”;
} else {
# insert a new record in the database for this pitch
if ($pitchfx) {
$pitch_query = ‘INSERT INTO pitches (ab_id, des, type, id, x, y, start_speed,’
. ‘ end_speed, sz_top, sz_bot, pfx_x, pfx_z, px, pz, x0, y0, z0, vx0, vy0,’
. ‘ vz0, ax, ay, az, break_y, break_angle, break_length, pitch_type) VALUES (‘
. join(‘, ‘, ($select_ab_id, $pitch_des, $pitch_type, $pitch_id, $pitch_x,
$pitch_y, $start_speed, $end_speed, $sz_top, $sz_bot, $pfx_x, $pfx_z, $px,
$pz, $x0, $y0, $z0, $vx0, $vy0, $vz0, $ax, $ay, $az, $break_y,
$break_angle, $break_length, $p_type)) . ‘)’;
} else {
$pitch_query = ‘INSERT INTO pitches (ab_id, des, type, id, x, y) VALUES (‘
. join(‘, ‘, ($select_ab_id, $pitch_des, $pitch_type, $pitch_id, $pitch_x,
$pitch_y)) . ‘)’;
}
# print “SQL: $pitch_query\n”;
$sth= $dbh->prepare($pitch_query) or die $DBI::errstr;
$sth->execute();
}
}
}
like this.
And before using this, add a field named pitch_type in pitches table.
July 15, 2008 at 9:16 pm
[…] I downloaded MLB’s pitch f/x data, and put it all in a database as described in Mike Fast’s excellent post. The data is hit and miss for the 2007 season, but every pitch for every game is recorded for 2008. […]
July 17, 2008 at 6:18 pm
Ok, I’ve got the MySQL/Perl/Phpmyadmin, and the spider program worked wonderfully.
Here’s my problem. The parser script works very fast at the start, which was doing around 1 game per 15 seconds. But, after it gets around 800 games, the parser slows down to about 4 minutes per game.
Is this usual? Does the size of the database make that much of a difference?
July 18, 2008 at 8:52 am
Francois, that is normal. The size of the database does make that much of a difference, and it will get even slower with even more games.
It probably would have made more sense to make the database use MyISAM tables rather than InnoDB since MyISAM is supposed to be faster. I believe there is a way to convert from InnoDB to MyISAM, but I haven’t personally tried that yet.
July 18, 2008 at 9:00 am
Francois, another thing that has been suggested to me to improve the update speed is to disable keys, then do the update, then enable keys again.
http://dev.mysql.com/doc/refman/5.0/en/alter-table.html
“ALTER TABLE … DISABLE KEYS tells MySQL to stop updating non-unique indexes. ALTER TABLE … ENABLE KEYS then should be used to re-create missing indexes. MySQL does this with a special algorithm that is much faster than inserting keys one by one, so disabling keys before performing bulk insert operations should give a considerable speedup.”
Again, I haven’t tried this but it’s probably a worthwhile thing to do and a better first step than converting the table type to MyISAM.
July 18, 2008 at 10:11 am
Thanks.
I had actually tried converting to MyISAM after I posted that, and it is considerably faster.
I’ve thought about trying the keys, but I haven’t tried that yet.
To be honest, I think the fastest way to get the data would do it one month at a time, and then move it to a different database. (If that makes sense.) The problem with that is the game_id would reset to 1, but who says the game_id should be 1,2,3,4…. Why not use the game_id that mlb uses, such as “gid_2008_06_09_tbamlb_anamlb_1” You could do the pitch/AB data like “gid_2008_06_09_tbamlb_anamlb_1_pitch_100” or “gid_2008_06_09_tbamlb_anamlb_1_ab_15”
I don’t know, it’s an idea.
July 19, 2008 at 4:01 pm
When I run the .pl spider script, I get the following error:
501 Errno architecture (MSWin32-x86-multi-thread-4.0) does not match executable architecture (MSWin32-x86-multi-thread-5.00)
I am thinking that my version of perl doesn’t match the version that the script is written for – but I’ve got no idea how to address it. Any ideas? I’m using XAMPP 1.6.7 on a Windows XP machine with the Perl 5.10.0-2.2.9 add-on.
Thanks!
July 27, 2008 at 4:32 pm
Hi Mike,
Thanks so much for these instructions, this is a fantastic resource.
I was wondering if you might be to help me out w/ an early-stage problem.
I’ve used xampp to install mysql/phpmyadmin/apache. I downloaded activeperl version 5.8 (I don’t think 5.6 works with Vista). The downloading script works fine, but when I try to run the parser script I get this error message:
DBD::mysql::st execute failed: You have an error in your SQL syntax; check the m
anual that corresponds to your MySQL server version for the right syntax to use
near ”)’ at line 1 at hack_28_parser_mikefast_test.pl line 106.
DBD::mysql::st execute failed: You have an error in your SQL syntax; check the m
anual that corresponds to your MySQL server version for the right syntax to use
near ”)’ at line 1 at hack_28_parser_mikefast_test.pl line 149.
duplicate game entry in database or game not found.
Any idea how to fix this? Thanks very much.
August 2, 2008 at 10:01 pm
Hi,
Whenever I run the hack_28_spider.pl it dies after a couple of days worth of games. I can never get it to go longer than two or three days.
Thanks
August 3, 2008 at 11:08 am
Hi Mike,
Great post, and many thanks for sharing your code and responding to comments. I just populated my mySQL database with over 500k pitches with your scripts. It took about a week of hacking, getting this to work on OS X Leopard. Really the installations should be simple. You just need (as you say) DBI and DBD:mysql. I was having trouble with make (I needed to install XCode) and sudo (my password was blank, but this screws up with sudo, so I needed to change my password to something). Oh, and I was using MAMP for mySQL, and there are some additional files needed for that (just Google). But beyond that, it was pretty easy. Moving on…two questions:
1. Should the game_types and pitch_types tables be populated? Mine aren’t.
2. Did you ever find a solution for the issue raised above by Dan and John re: pitching changes. The error stops the parser, so I just removed that last else statement lines 298-300. I assume this means I’m not storing pitches after a pitching change.
August 8, 2008 at 12:57 am
I’m having a problem with the parser. First I don’t know if I”m supposed to run it through the phpmyadmin or running it through perl. If I do it with php i get an error DBI doesn’t exist message, if i run it with activeperl nothing happens, the command window pops up and just disappears.
Any help would be greatly appreciated, thanks.
August 8, 2008 at 5:40 pm
fixed it, had the wrong base directory.
Thanks so much for this info Mike I’ll be sure to post back when I get it up and fully running.
December 20, 2008 at 2:19 pm
First of all, a huge thank you to everyone that has contributed to this — this is an incredible resource!
I’m using the spider script to get 2008 PBP data, but am encountering an intermittent “500 Internal Server Error”. Sometimes it will run for 10+ minutes, and sometimes for only a minute or less. Any ideas as to what the issue is and if I’m possibly missing data because of the error? Thanks in advance!
December 24, 2008 at 10:04 am
Mr. Fast,
I’m having some issues downloading the data. I want the data to download into a separate directory that I created, but it defaults into windows/system32. Is that okay, or will that be a problem?
Thanks
January 7, 2009 at 1:42 pm
David, I open a command window, cd to the directory where I want the data to download, and then execute the spider.pl script from there.
January 9, 2009 at 12:19 pm
Is there a way to download JUST game.xml data. I used the old spider and have all the other data and do not really wanna run everything again.
Also if this isnt possible, is there someway I could modify the update-db-count.pl so that I could use it with the old database structure?
January 9, 2009 at 7:06 pm
nm I figured out how to get the just game.xml files but now im running into another problem. When I run the parser im getting:
Unknown column “sv_id” in “field list” line 440.
Sometimes I get unknown column “on_1b” in “field list” line 440.
Also it seems like my pitches table is completely empty.
January 19, 2009 at 10:26 pm
Hey Mike,
Any advice on how to keep the spider from timing out? I keep running into that error, getting no more than a day’s worth of data.
April 19, 2009 at 1:59 pm
[…] have imported all the pitch f x data from mlb.com for 2007, and 2008 using Fastballs wordpress site. It is available in a mysql export. Please let me me know if you have any questions, or issues […]
August 19, 2009 at 7:23 am
[…] most comprehensive tutorial I found on the matter was written by Mike Fast. However, he wrote it just under 2 years ago, and some of the information may be outdated. […]
August 21, 2009 at 11:02 am
I have downloaded and appropriately modified your database structure and your parser script. I keep getting the following error when I attempt to parse the XML data:
syntax error at line 1, column 54, byte 54 at C:/Perl/lib/XML/Parser.pm line 187
Any thoughts?
November 11, 2011 at 4:50 pm
Aaron
— me too; did you ever get a resolution?
September 29, 2009 at 3:36 pm
Dear everyone on this site who has successfully managed to get this database up and running. I’m an analyst who lacks the programming skills (after hours, and hours, and hours of trying to figure it out) to get this working.
Would anyone be willing to take mercy on me and either provide the past data that is available (if you’re in NY, I can come to you to pick it up with a large external hard drive), or even show me how to get this sucker working on my computer? I would love to compensate you for your time.
I can be reached by email: mollerw(at)gmail.com. Please, please, please be in touch!
Will
October 15, 2009 at 4:07 pm
[…] to know how to make your own pitch f/x database, here are a couple links that should help you out (#1 and #2). And for the rest you lazy and/or overworked folks, here is the first part of the […]
December 5, 2009 at 10:14 pm
[…] Mike Fast’s Method: A well done step by step plan on how to create a database using MySQL and Perl. […]
December 13, 2009 at 7:49 am
[…] Stick is a newer Python and MySQL based system you can use to create your own personal database. Mike Fast has also written a thorough step-by-step method to create your database based off the XML serialization of MLB’s gameday data and a group of […]
March 4, 2010 at 4:55 pm
What would one have to change in the XML-to-DB parser script to utilize an Microsoft SQL Server connection instead of a MySQL connection?
Thanks!
April 2, 2010 at 10:13 pm
For those who are familiar with the Ruby language, I developed a library in Ruby for working with the Gameday data. You can use the Gameday API for getting and processing any Gameday data. The library is open source and freely available from: http://gameday.timothyfisher.com
May 1, 2010 at 4:50 pm
Help
When I add this line (or replace the boxscore function with it) – the perl script breaks – based on color coding there’s some extra text in the first line espcially
if($gamehtml =~ m/get($gameurl);
if($gamehtml =~ m/get($gameurl);
die “Couldn’t get $gameurl: “, $response->status_line, “\n”
unless $response->is_success;
$infohtml = $response->content;
open GAME, “>$gamedir/game.xml”
or die “could not open file $gamedir/game.xml: $|\n”;
print GAME $infohtml;
close GAME;
} else {
print “warning: no xml game file for $game\n”;
}
May 6, 2010 at 2:06 pm
If anyone is interested, I’ve written a Windows program to download, parse and store PitchFX data into an Access or SQL database. You can find it here: http://www.operationsports.com/Bahnzo/blog/7738-make-your-own-pitchfx-database/
February 24, 2012 at 8:41 am
I put together a database for 2007 through 2011 if people have trouble getting this to work and just want to jump right into the analysis
http://pitchyfx.blogspot.com/2012/02/downloadable-pitchfx.html
June 27, 2012 at 8:49 am
Is anyone else having trouble with the latest version of ActivePerl? I am getting an error that it can’t find Switch.pm, which apparently has been deprecated since this was written. I’m not super knowledgeable in Perl so I’m having trouble rewriting… suggestions?
September 4, 2012 at 12:10 pm
go to http://search.cpan.org/ and search for Switch. Download it and put the Switch.pm file in the your Perl lib folder (e.g. C:\Perl\lib). Had to do the same thing myself …
August 6, 2012 at 7:29 pm
Anyone still loading this data into MYSQL databases?
January 10, 2013 at 1:48 pm
[…] League Baseball PITCHf/x data has been available for years. However, even the <a href=”https://fastballs.wordpress.com/2007/08/23/how-to-build-a-pitch-database/”>best methods</a> are hard to implement if you’re not a web developer/computer geek. It […]
August 30, 2013 at 4:05 pm
THIS IS SO AWESOME.
Can’t wait to get my hands on this data and run with it! I’ve also made some tweaks to the code which hopefully I will get around to posting online some day. Thanks so much!
September 19, 2013 at 10:34 am
[…] thanks to MLB for making the pitch f/x data freely available (seriously, how awesome is that?), Mike Fast for teaching me how to make a pitch f/x database, and Brooks Baseball for making the images in this post. Also, thanks to you for reading this […]
May 2, 2016 at 7:09 pm
I’m having trouble accessing the files :(. How do I sign up?
May 2, 2016 at 11:08 pm
OK, found the files… Sorry about that. Hadn’t read down all the way to the later edits.
Is this still an active blog? Had a few questions.
March 7, 2017 at 7:25 am
[…] There are tutorials all around the internet on how to scrape data — for example, here’s one authored by Houston Astros employee Mike Fast. The gist is this: with a little technical training […]
March 7, 2017 at 7:32 am
[…] There are tutorials all around the internet on how to scrape data — for example, here’s one authored by Houston Astros employee Mike Fast. The gist is this: with a little technical training […]
March 7, 2017 at 7:50 am
[…] There are tutorials all around the internet on how to scrape data — for example, here’s one authored by Houston Astros employee Mike Fast. The gist is this: with a little technical training […]
March 7, 2017 at 8:06 am
[…] There are tutorials all around the internet on how to scrape data — for example, here’s one authored by Houston Astros employee Mike Fast. The gist is this: with a little technical training […]
March 7, 2017 at 9:00 am
[…] There are tutorials all around the internet on how to scrape data — for example, here’s one authored by Houston Astros employee Mike Fast. The gist is this: with a little technical training […]
December 5, 2018 at 3:55 am
[…] describe how to get the data for your own research: • webusers.npl.uiuc.edu • fastballs.wordpress.com Or, of course, you can send your questions to dsandora@tribweb.com and we’ll do the work for […]
April 3, 2021 at 3:38 am
[…] original PITCHf/x data was stored in public files, which early pioneers of pitch analysis quickly learned to use. Bloggers and hobbyists did such good work that teams began hiring them; Mike Fast, Josh Kalk and […]