|
APBRmetrics The statistical revolution will not be televised.
|
View previous topic :: View next topic |
Author |
Message |
mateo82
Joined: 06 Aug 2005 Posts: 211
|
Posted: Mon May 21, 2007 11:14 pm Post subject: |
|
|
I tried it on my home computer using ocrad and gocr and it was indeed too noisy. I have Adobe Professional at work and it has a fairly good ocr if I recall, i'll try it there tomorrow. |
|
Back to top |
|
|
HeatherA
Joined: 03 Aug 2006 Posts: 55
|
Posted: Tue May 22, 2007 10:51 am Post subject: |
|
|
PaulG and I would both be happy to contribute our time to this effort if you decide to launch it. |
|
Back to top |
|
|
Ed Küpfer
Joined: 30 Dec 2004 Posts: 787 Location: Toronto
|
Posted: Tue May 22, 2007 11:32 am Post subject: |
|
|
More notes: the PoR website has a cap on how many PDF downloads one user can make during a 24-hour period. What this means is that you should try to find the page with the box scores selectively, without scrolling through an entire issue of Sporting News. Unfortunately, TSN doesn't have a table of contents, so there's a lot of flipping through pages to find the boxes.
Here are the 3 stepts for inputting the box scores:
1. Track down the issue/page number of the boxscores in TSN.
2. Download that particular page.
3. Input the data into some spreadsheet.
#1 is a pain. If someone can find a search term that narrows down the hits, that would be good. Because the pdf sources are large (~500kb), hunting for the right pages is a challenge. What I propose is to split up the effort a little: I am in the process of downloading onto my hard drive the pages with the boxscores from each issue. Later, me or someone else can input the data from these pages -- I can email them to anyone who wants to contribute an hour or two.
But I think that because of the download restrictions, anyone who downloads a TSN page with boxscores should take care to save a copy to their HD (I've been saving each file with an issue date-page number format to keep track). _________________ ed |
|
Back to top |
|
|
jkubatko
Joined: 05 Jan 2005 Posts: 702 Location: Columbus, OH
|
Posted: Tue May 22, 2007 11:47 am Post subject: |
|
|
Ed Küpfer wrote: | Unfortunately, TSN doesn't have a table of contents, so there's a lot of flipping through pages to find the boxes. |
There's usually a table of contents somewhere in the first 4-5 pages of each issue (at least for the ones I have seen).
Ed Küpfer wrote: | #1 is a pain. If someone can find a search term that narrows down the hits, that would be good. Because the pdf sources are large (~500kb), hunting for the right pages is a challenge. What I propose is to split up the effort a little: I am in the process of downloading onto my hard drive the pages with the boxscores from each issue. Later, me or someone else can input the data from these pages -- I can email them to anyone who wants to contribute an hour or two. |
First, what season are you working on? I just gave that 1979-80 box score as an example, not intending for that to be our test season. I actually started entering the 1969-70 NBA season yesterday and I am about 20 games in. I chose this as a test season for two reasons: (1) fewer teams, which means fewer games to enter and (2) the ABA box scores are also there, so I'll have all the files I need when I want to enter those.
My process for finding the box scores has been to set the date to the issue date and leave the search field blank. The first result should be the first page of that issue. I then look through the first few pages for the table of contents, and that helps me find the box scores.
Ed Küpfer wrote: | But I think that because of the download restrictions, anyone who downloads a TSN page with boxscores should take care to save a copy to their HD (I've been saving each file with an issue date-page number format to keep track). |
Yes, saving copies of the PDFs is a great idea. Ed, can you give me an example of a file name that you are using? I just want to understand exactly how you are naming the files. _________________ Regards,
Justin Kubatko
Basketball-Reference.com |
|
Back to top |
|
|
Ed Küpfer
Joined: 30 Dec 2004 Posts: 787 Location: Toronto
|
Posted: Tue May 22, 2007 11:59 am Post subject: |
|
|
jkubatko wrote: | First, what season are you working on? I just gave that 1979-80 box score as an example, not intending for that to be our test season. I actually started entering the 1969-70 NBA season yesterday and I am about 20 games in. I chose this as a test season for two reasons: (1) fewer teams, which means fewer games to enter and (2) the ABA box scores are also there, so I'll have all the files I need when I want to enter those. |
I'm open. I spent the day yesterday exploring the possibilities of the PoR site. I can see that navigating it will be an obstical to many people, which is why I thought that having some people dedicated to finding and saving pdfs means the people who just want to enter data can do it with less fuss.
Quote: | Ed, can you give me an example of a file name that you are using? I just want to understand exactly how you are naming the files. |
yyyy-mm-dd.PageNum.pdf
1979-10-27.57.pdf = Oct 27, 1979, page 57. _________________ ed |
|
Back to top |
|
|
jkubatko
Joined: 05 Jan 2005 Posts: 702 Location: Columbus, OH
|
Posted: Tue May 22, 2007 12:02 pm Post subject: |
|
|
Right now we have Ed, Ryan, Heather, Paul, Mateo (yes?), and myself as people who are interested in trying this out on a test season. In the 1969-70 season there were 7*82 = 574 regular season games, so that would mean about 100 games each. It takes me about 3 minutes to enter a game by hand, so that's about 5-6 hours of work. Here is the first game in my 1969-70 file:
Code: |
"date","lgID","teamID","oppID","name","FG","FT","FTA","DQ"
10/14/1969,"NBA","SEA","NYK","Allen",3,0,0,
10/14/1969,"NBA","SEA","NYK","Boozer",3,4,4,
10/14/1969,"NBA","SEA","NYK","Clemens",2,2,2,
10/14/1969,"NBA","SEA","NYK","Harris",4,1,1,
10/14/1969,"NBA","SEA","NYK","Meschery",0,2,3,
10/14/1969,"NBA","SEA","NYK","Mueller",1,0,2,
10/14/1969,"NBA","SEA","NYK","Murrey",0,0,0,
10/14/1969,"NBA","SEA","NYK","Rule",11,5,7,
10/14/1969,"NBA","SEA","NYK","Thorn",1,1,2,
10/14/1969,"NBA","SEA","NYK","Tresvant",4,10,12,
10/14/1969,"NBA","SEA","NYK","Wilkens",3,4,7,
10/14/1969,"NBA","SEA","NYK","Winfield",3,2,2,
10/14/1969,"NBA","NYK","SEA","Barnett",10,2,4,
10/14/1969,"NBA","NYK","SEA","Bowman",0,0,0,
10/14/1969,"NBA","NYK","SEA","Bradley",5,2,2,
10/14/1969,"NBA","NYK","SEA","DeBusscherre",6,1,3,
10/14/1969,"NBA","NYK","SEA","Frazier",5,6,9,
10/14/1969,"NBA","NYK","SEA","Hosket",0,0,0,
10/14/1969,"NBA","NYK","SEA","May",0,3,3,
10/14/1969,"NBA","NYK","SEA","Reed",14,0,0,
10/14/1969,"NBA","NYK","SEA","Riordan",5,2,2,
10/14/1969,"NBA","NYK","SEA","Russell",4,0,0,
10/14/1969,"NBA","NYK","SEA","Stallworth",4,0,0,
10/14/1969,"NBA","NYK","SEA","Warren",1,2,2,
|
No need to include fields for rebounds, etc., at the moment because they are not listed in the box scores; at the end I'll just add them in as null fields. After we get the season done, I'll put all of the files together and then match things to my database. That way we can link the names to my player ID system. I'll also do various QC checks, like making sure that player points equal team points for all games.
Should we start to split up the issues so we can get to work, or are there other things we need to discuss? _________________ Regards,
Justin Kubatko
Basketball-Reference.com |
|
Back to top |
|
|
ziller
Joined: 30 Jun 2005 Posts: 126 Location: Sac Metro
|
|
Back to top |
|
|
jkubatko
Joined: 05 Jan 2005 Posts: 702 Location: Columbus, OH
|
Posted: Tue May 22, 2007 12:05 pm Post subject: |
|
|
Ed Küpfer wrote: | yyyy-mm-dd.PageNum.pdf
1979-10-27.57.pdf = Oct 27, 1979, page 57. |
Perfect. Let's use that naming scheme. _________________ Regards,
Justin Kubatko
Basketball-Reference.com |
|
Back to top |
|
|
jkubatko
Joined: 05 Jan 2005 Posts: 702 Location: Columbus, OH
|
Posted: Tue May 22, 2007 12:11 pm Post subject: |
|
|
FYI, please don't start grabbing PDF files of the box scores yet, as we don't want people to be doing the same work. That would just be a waste of time and of PoR's bandwidth.
So we'll do the 1969-70 season. Items to address:
1) Get volunteers. If you're not sure you can do it for at least one test season, then don't volunteer. Let's wait another day or two and see if anyone else is interested.
2) Divvy up the issues. The 1969-70 issues have both NBA and ABA box scores, so we might as well grab the pages that have the ABA box scores, even though we'll just start with the NBA season. _________________ Regards,
Justin Kubatko
Basketball-Reference.com |
|
Back to top |
|
|
jkubatko
Joined: 05 Jan 2005 Posts: 702 Location: Columbus, OH
|
Posted: Tue May 22, 2007 12:32 pm Post subject: |
|
|
Update: I just grabbed all of the NBA and ABA box score pages from the 1969 part of the 1969-70 season. _________________ Regards,
Justin Kubatko
Basketball-Reference.com |
|
Back to top |
|
|
mateo82
Joined: 06 Aug 2005 Posts: 211
|
Posted: Tue May 22, 2007 2:14 pm Post subject: |
|
|
Yes, I'm in.
I'm not sure I understand how you want this formatted though. I'm assuming B-R uses a mysql database, right? So, do you want just one big file with each line representing one players statline for a particular night, or do you want one file for each day or one file for each game? |
|
Back to top |
|
|
jkubatko
Joined: 05 Jan 2005 Posts: 702 Location: Columbus, OH
|
Posted: Tue May 22, 2007 2:20 pm Post subject: |
|
|
mateo82 wrote: | Yes, I'm in.
I'm not sure I understand how you want this formatted though. I'm assuming B-R uses a mysql database, right? So, do you want just one big file with each line representing one players statline for a particular night, or do you want one file for each day or one file for each game? |
At the end it will be one big file with one stat line per player, per game. _________________ Regards,
Justin Kubatko
Basketball-Reference.com |
|
Back to top |
|
|
Ryan J. Parker
Joined: 23 Mar 2007 Posts: 711 Location: Raleigh, NC
|
Posted: Tue May 22, 2007 8:56 pm Post subject: |
|
|
Definitely don't want to duplicate efforts, so divide up seasons so we know where to focus our efforts.
Oh, and the naming convention sounds good. |
|
Back to top |
|
|
94by50
Joined: 01 Jan 2006 Posts: 499 Location: Phoenix
|
Posted: Wed May 23, 2007 4:08 am Post subject: |
|
|
I've done plenty of data entry in the past. I'd love to help. |
|
Back to top |
|
|
jkubatko
Joined: 05 Jan 2005 Posts: 702 Location: Columbus, OH
|
Posted: Wed May 23, 2007 8:11 am Post subject: |
|
|
94by50 wrote: | I've done plenty of data entry in the past. I'd love to help. |
Great. By my count that brings us up to 8:
Justin
Ed
Ryan
Heather
Paul
Mateo
Ziller
94by50
Let's wait until tomorrow to see if anyone else is interested in helping out. I have already entered the box scores from the first two issues of the 1969-70 season, and I have downloaded the pages with the box scores through January. Once again, so that we don't repeat the same work, you don't need to do anything until we have made assignments for the test season. _________________ Regards,
Justin Kubatko
Basketball-Reference.com |
|
Back to top |
|
|
|
|
You cannot post new topics in this forum You cannot reply to topics in this forum You cannot edit your posts in this forum You cannot delete your posts in this forum You cannot vote in polls in this forum
|
Powered by phpBB © 2001, 2005 phpBB Group
|