View previous topic :: View next topic |
Author |
Message |
gabefarkas
Joined: 31 Dec 2004 Posts: 879 Location: Durham, NC
|
Posted: Mon Jul 16, 2007 8:11 am Post subject: |
|
|
Along these lines, is there any possibility of doing a career similarity, not just player-season? |
|
Back to top |
|
|
Ben
Joined: 13 Jan 2005 Posts: 202 Location: Iowa City
|
Posted: Mon Jul 16, 2007 2:10 pm Post subject: |
|
|
Just came across this thread - it looks great, Justin. |
|
Back to top |
|
|
jkubatko
Joined: 05 Jan 2005 Posts: 508 Location: Columbus, OH
|
Posted: Mon Jul 16, 2007 2:54 pm Post subject: |
|
|
Re: the similarity scores, it's not as easy as it may sound. Calculating the sim scores on the fly is too time-intensive, and a table with every possible player/season combination would be ginormous (hey, Merriam-Webster says it's a word now). I'll have to think about this. _________________ Regards,
Justin Kubatko
Basketball Stats! |
|
Back to top |
|
|
THWilson
Joined: 19 Jul 2005 Posts: 126 Location: phoenix
|
Posted: Mon Jul 16, 2007 4:51 pm Post subject: |
|
|
jkubatko wrote: | Re: the similarity scores, it's not as easy as it may sound. Calculating the sim scores on the fly is too time-intensive, and a table with every possible player/season combination would be ginormous (hey, Merriam-Webster says it's a word now). I'll have to think about this. |
I don't know if this is feasible, but just a thought. Since sim scores are based (primarily) on standard deviations away from the mean for that season, couldn't you create a table which doesn't have any of the actual player figures, but just their standard deviations above or below the mean? You could add in height and then there would be all the necessary fields on that table. Then to calculate on the fly you'd just have to reference this one similarity-table. Would that solve the time intensity issue? |
|
Back to top |
|
|
Ed Küpfer
Joined: 30 Dec 2004 Posts: 616 Location: Toronto
|
Posted: Mon Jul 16, 2007 5:03 pm Post subject: |
|
|
THWilson wrote: | Since sim scores are based (primarily) on standard deviations away from the mean for that season, couldn't you create a table which doesn't have any of the actual player figures, but just their standard deviations above or below the mean? You could add in height and then there would be all the necessary fields on that table. |
That would be 11 variables for each player season.
Quote: | Then to calculate on the fly you'd just have to reference this one similarity-table. Would that solve the time intensity issue? |
Since there are about 500 players/season, the similarity distance matrix would be 15,000 x 15,000. That is pretty big. _________________ ed |
|
Back to top |
|
|
THWilson
Joined: 19 Jul 2005 Posts: 126 Location: phoenix
|
Posted: Tue Jul 17, 2007 10:41 am Post subject: |
|
|
Ed Küpfer wrote: |
Quote: | Then to calculate on the fly you'd just have to reference this one similarity-table. Would that solve the time intensity issue? |
Since there are about 500 players/season, the similarity distance matrix would be 15,000 x 15,000. That is pretty big. |
I'm not saying a matrix.
My suggestion was a z-score database on which you could use the exact same steps that Justin is already using to pull up the per-game or total counts for the two players, plus one more step to calculate the differences? Like he does here: http://www.basketball-reference.com/about/similar.html
Rather than doing the whole thing on the fly, or pre-calculating the whole thing, he would have one step pre-calculated (the z-scores), and the second step done at the time of the query (the difference summation). I don't think that would be especially large or especially long running. Make sense? |
|
Back to top |
|
|
Ed Küpfer
Joined: 30 Dec 2004 Posts: 616 Location: Toronto
|
Posted: Tue Jul 17, 2007 3:10 pm Post subject: |
|
|
THWilson wrote: | Rather than doing the whole thing on the fly, or pre-calculating the whole thing, he would have one step pre-calculated (the z-scores), and the second step done at the time of the query (the difference summation). I don't think that would be especially large or especially long running. Make sense? |
I don't know if I understand you. You can have all the differences pre-calculated, but you'd still have to store them in that big ass matrix. If you want to compare 2 or more players, you'd need to either caculate the differences on the fly or look up the calculated differences from that player x player matrix.
One way to make it quicker is to have a list of x most similar players for each player (as shown on each player page), but that wouldn't allow you to compare the differences between two aribtrary players. _________________ ed |
|
Back to top |
|
|
THWilson
Joined: 19 Jul 2005 Posts: 126 Location: phoenix
|
Posted: Tue Jul 17, 2007 5:50 pm Post subject: |
|
|
Ed Küpfer wrote: | I don't know if I understand you. You can have all the differences pre-calculated, but you'd still have to store them in that big ass matrix. |
Don't pre-calculate the differences, pre-calculate the z-scores. Calculate (and sum) the differences between z-scores at the time that the players are selected, at the time of the query. |
|
Back to top |
|
|
Ed Küpfer
Joined: 30 Dec 2004 Posts: 616 Location: Toronto
|
Posted: Tue Jul 17, 2007 7:59 pm Post subject: |
|
|
THWilson wrote: | Don't pre-calculate the differences, pre-calculate the z-scores. Calculate (and sum) the differences between z-scores at the time that the players are selected, at the time of the query. |
I get the feeling we're talking past each other, because I can't really make sense of that. I want to work it out because similarity stuff is really interesting to me.
Consider a list of 30 North American citiies. Say you want to cacluate the 10 closet neighbours for all 30 cities. To do that you take their latitude and longituge and find the distance between each pair of cities. You'll end up with a 30x30 matrix of distances. The 10 closest neighbours are then the 10 smallest distances for each city.
But wait -- distance between citiies is 3-dimensional because of the curvature of the earth and because of each city's altitude (not a big difference in this example, but just go with me on this). To caluclate the 3-dimenisional distance, you do it the same way as the 2-D version: euclidean distance is sqrt((Lat_city1 - Lat_city2)^2 + (Long_city1 - Long_city2)^2 + (Alt_city1 - Alt_city2)^2). The result is a distance metric -- close neighbours will have smaller distances than far neighbours.
That is how Justin calculates similarity. The z-score transformation precedes the distance calculations (and in my experience, is not a necessary step). Standardising numbers into z-scores is not a comupationally intensive act -- the hard work is coming up with the distance matrix, which is the last step before finding similar players. _________________ ed |
|
Back to top |
|
|
jkubatko
Joined: 05 Jan 2005 Posts: 508 Location: Columbus, OH
|
Posted: Tue Jul 17, 2007 9:08 pm Post subject: |
|
|
Ed, I think you guys are talking past each other. I believe that Tom is suggesting that I just display the sim score between the two seasons and not the rank. I think it was Neil that suggested I display the sim score and the rank, which would be a much more difficult proposition (as you have demonstrated). _________________ Regards,
Justin Kubatko
Basketball Stats! |
|
Back to top |
|
|
davis21wylie2121
Joined: 13 Oct 2005 Posts: 373 Location: Atlanta, GA
|
Posted: Tue Jul 17, 2007 9:19 pm Post subject: |
|
|
jkubatko wrote: | Ed, I think you guys are talking past each other. I believe that Tom is suggesting that I just display the sim score between the two seasons and not the rank. I think it was Neil that suggested I display the sim score and the rank, which would be a much more difficult proposition (as you have demonstrated). |
So is it feasible as long as you don't show the rank? |
|
Back to top |
|
|
jkubatko
Joined: 05 Jan 2005 Posts: 508 Location: Columbus, OH
|
Posted: Wed Jul 18, 2007 8:57 am Post subject: |
|
|
davis21wylie2121 wrote: | So is it feasible as long as you don't show the rank? |
Feasible, yeah, but it's going to be some time before I can think about adding it, as I have some other projects that need my attention right now. I'll add it to my novel-length list of things to add to the site. :-) _________________ Regards,
Justin Kubatko
Basketball Stats! |
|
Back to top |
|
|
|