Let me start by saying that I am very grateful to Sportvision and MLBAM for making the PITCHf/x and Gameday data freely available. It’s a great tool for research and for fans of baseball in general. It’s amazing that this data can be recorded and modeled so accurately.
In 2008, the PITCHf/x research community has exploded. There are many eyes poring over the data. I think it would be great if we could work together to document any errors we find in the data. The purpose is twofold. One is so that MLBAM and Sportvision can improve their already amazing product. While the talented cat is playing the piano, he might miss a note here or there, and perhaps he could play better next time. 🙂
The first purpose could as easily be accomplished by communicating privately with the MLBAM and Sportvision folks as posting the errors here. They have a long track record of being responsive and helpful. However, communicating privately doesn’t do anything to help those researchers who downloaded the bad data the day after the game and are still using it for their research and analysis. So I’d like to track the PITCHf/x and Gameday XML data problems publicly here so that other researchers can consult this list and correct their data.
Please, if you’re aware of errors in the data that are not listed here, even errors that have been subsequently corrected on the Gameday website, drop me a note or post in the comments. At this point in time I am not tracking pitches which are missing PITCHf/x data unless there are data missing for more than just a batter or two.
Most recent games are listed first.
Colorado at Chicago White Sox, June 13, 2008, 9th inning
The final at bat by Chris Iannetta is missing from the inning_9.xml file. You can find the missing data in the pbp/batters/455104.xml file for Iannetta.
Houston at Texas, May 17, 2008, 4th inning
The 3-1 pitch to Ramon Vazquez (sv_id=080517_193837) has incorrect PITCHf/x data. Video shows the pitch as a fastball clocked at 92 mph by the radar gun. The PITCHf/x data has a start_speed of 60.8 mph and an unrealistic x0 value of 5.232 feet.
Houston at San Francisco, May 12, 2008, 8th inning
The first three pitches to Fred Lewis are extraneous. They are a duplicate of the last three pitches to the previous batter (starting at sv_id=080512_213850).
Anaheim at Boston, April 23, 2008, 2nd inning
The first 26 pitches to Erick Aybar are extraneous. They are a duplicate of all the pitches thrown in the first inning (correct data for Aybar starts at sv_id=080423_192149).
Minnesota at Chicago White Sox, April 7, 2008, 3rd inning
There are four strikes to Paul Konerko. The fourth pitch appears to be extraneous (pitch id=225).
2007 Regular Season
Prior to the 2007 postseason, PITCHf/x data are missing for pitches which hit the batter.
Baltimore at Tampa Bay, September 4, 2007, 9th inning
The first three pitches to B.J. Upton are extraneous. They are a duplicate of the last three pitches to the previous batter (starting at pitch id=647). (h/t Matthew Carruth)
Houston at Milwaukee, September 3, 2007, 9th inning
The first two pitches to Bill Hall are extraneous. They are a duplicate of the last two pitches to the previous batter (starting at pitch id=796). (h/t Matthew Carruth)
Arizona at Florida, August 14, 2007, 9th inning
There is an extra pitch recorded to Mark Reynolds (a foul, three strikes, and a ball in play. I will see if I can determine which is the incorrect pitch (at bat starts at pitch id=735). (h/t Matthew Carruth)
LA Dodgers at Toronto, June 20, 2007, 6th inning
The first five pitches to Frank Thomas are extraneous. They are a duplicate of the last five pitches to the previous batter (starting at pitch id=409). (h/t Matthew Carruth)
Houston at Arizona, May 25, 2007, 1st inning
There are four balls to batter Conor Jackson. The fourth ball appears to be an extraneous pitch (pitch id=65). (h/t Matthew Carruth)
Texas at Anaheim, May 12, 2007, 1st inning
The fourth ball to Vladimir Guerrero should be labeled intentional, and the fifth ball is an extraneous pitch (pitch id=22). (h/t Matthew Carruth)
Oakland at Seattle, April 4, 2007, 9th inning
The first four pitches to Mark Ellis are extraneous. They are a duplicate of the last four pitches to the previous batter (starting at pitch id=556). (h/t Matthew Carruth)
Note: the following errors have been corrected by MLB in the data currently available on the Gameday site; however, check your downloaded data to see if it contains the fixes.
LA Dodgers at Florida, April 29, 2008, 5th inning
The original XML file listed five pitches for batter Doug Waechter (starting with sv_id=080429_203308). There should only be four pitches (the first called strike was extra), and each set of PITCHf/x data belongs to the following pitch. MLB has corrected this error in the XML file currently on the Gameday site.
NY Mets at Washington, September 19, 2007, 1st inning
The original XML file listed eight pitches for batter Luis Castillo (starting with pitch id=4). There should only be two pitches (the first six pitches were duplicated from the previous batter). MLB has corrected this error in the XML file currently on the Gameday site.
Note: the following are correct in the Gameday data, but game advisories note that the umpire lost track of the ball-strike count.
Cleveland at Cincinnati, May 16, 2008, 4th inning
David Dellucci saw four balls and two strikes before putting the ball in play. (h/t Matthew Carruth)
Baltimore at Oakland, May 5, 2008, 2nd inning
Jack Cust was called out on strikes after one ball and four strikes. (h/t Matthew Carruth)