Thanks a lot, Aaron. This is extremely valuable, both the PbP's and the matchup data. Having said that, I second the motion to split offensive and defensive possessions.
Anyway, I couldn't help but run a quick regression on the aggregate matchups data to roughly compute adjusted +/- ratings. It lacks all the tunings that Dan Rosenbaum applies (offensive vs defensive possessions, reference players, garbage time weighting, positional and team adjustments, ...) but still produces interesting results, in particular to compare players of the same team. Case in point, the Hornets:
Code:
Player Estimate Std. Error t val Pr(>|t|) Net on/off
Paul 14.7 5.6 2.6 0.0 ** -1.0
Claxton 12.7 5.3 2.4 0.0 * 7.8
Snyder 11.8 5.5 2.1 0.0 * 5.6
Vroman 5.3 6.8 0.8 0.4 6.4
J.R. Smith 5.0 5.6 0.9 0.4 -4.0
Macijauskas 4.0 10.1 0.4 0.7 0.3
P. Brown 2.4 5.7 0.4 0.7 -1.2
M. Jackson 2.2 5.7 0.4 0.7 -6.0
Nachbar 1.5 6.5 0.2 0.8 3.6
R. Butler 0.9 5.2 0.2 0.9 2.0
David West 0.4 5.6 0.1 0.9 -0.9
M. Norris 0.3 6.7 0.0 1.0 -9.7
L. Johnson -0.1 6.5 0.0 1.0 -1.9
Andersen -0.3 6.4 0.0 1.0 1.6
A. Williams -1.7 6.0 -0.3 0.8 2.6
Mason -5.2 5.5 -0.9 0.3 -6.5
Bass -6.4 7.7 -0.8 0.4 -6.8
M. Lampe -15.2 19.3 -0.8 0.4 -13.1
Fizer -20.2 17.3 -1.2 0.2 -8.1
This clearly suggests that the main reason for Chris Paul bad net on/off numbers is the players with whom he was on the court, and not that he had a negative impact on his teammates. Paul has the 15th best rating in the league, just ahead of Wade, Artest, Diaw, Iguodala and Duncan. Take that with a grain of salt because, as I said, in this raw format the ratings are not good enough to compare across teams.
I'm looking forward to seeing the more polished adjusted +/- ratings at basketballvalue, as I believe they are one of the most telling stats. By the way, do you plan to use data from previous seasons to compute them?
Last edited by cherokee_ACB on Mon Oct 02, 2006 4:44 pm; edited 4 times in total
# 2. Let's now convert the matchups into +1 / 0 / -1 coefficients
# aggs.txt is aggregatematchups.txt minus initial and final lines
open(MATCHUPS, "<aggs.txt");
# Write it down
foreach $key (sort(keys(%players))){
$playing=0;
for ($i=0; $i<5; $i++) {
if ($key == $home[$i]) {
$playing=1;
}
elsif ($key == $away[$i]) {
$playing=-1;
}
}
print "," . $playing;
}
print "\n";
}
Make sure you create db.txt (see above) and aggs.txt from the basketballvalue.net files, and redirect the output to another file, like this
$ perl nba_adj.pl > obser.csv
This will take some time. obser.csv will be a huge (63 MB) text file full of ones and zeros. Once the script has finished, go to names.txt, and replace whitespaces and other odd characters (like - and ') with dots (.), or the regression won't work.
Now, I use R for the regression. The commands to enter are
Unfortunately, I couldn't get this to work in Windows because of memory issues. So I switched to Linux, where the constant paging made it awfully slow, but eventually I got the results. By my estimation, a minimum of 2 GB of virtual memory (physical memory + swap space) is required. Needless to say, the more RAM you have, the better.
And that's all. That was my quick regression.
P.S.: if someone wants to try anyway in Windows, enter "memory.limit(2000)" as the first command or you'll surely run out of memory.
Last edited by cherokee_ACB on Thu Nov 23, 2006 1:30 pm; edited 4 times in total
I'm sorry but there was a couple of mistakes in the above explanation. The first line of "names.txt" must be "Name". You can enter it manually or use the updated Perl script above (which now also does the required conversion of spaces into dots).
The second mistake is that there are duplicated names in the players.txt file (like D. Jones, M. Williams or D. West), and I wasn't handling it (this may have affected the Hornets players' ratings; I'll rerun the regression and report about it). The only solution here is to manually edit db.txt so that the PlayerTrueName becomes a unique identifier. There's something like 10-12 duplicated names. If possible, I'd request Aaron and the people at basketballvalue to define unique names for all players, as it's easier to work with them than with numerical ids.
I'd be interested in your revised Paul/Claxton adjusted +/- estimates and those determined by basketballvalue and Dan as well. If a higher adjusted +/- for Paul stands / is generally agreed to that would be a strong example for the need for this information. But breakout and interpretation is still in order.
Paul's negative unadjusted +/- and lower win % on the court than team as a whole could indeed be affected by weaker than average teammates and / or stronger opponents. I accept the stronger opponent argument a la the data deepak presented about Kobe, both probably play heavily against starters. (I'd be interested a similar run on Paul if you are willing.)
But was Paul burdened with weaker than normal teammates compared to Hornet totals? Comparing your results to just the defensive difference would indicate how much this is a factor. Mason last season I grant was a liability to play beside alot. The other three biggest pairings on the court for Paul though were West, Brown and Claxton. Were they a burden? Doesnt seem like it- Claxton was 2nd best adjusted +/- in your reported numbers and Brown and West combined close to neutral (if I am reading it right). Or is time with and not with not other teammates somehow decisive?
This information may have been superceded, but prior to receiving the adjusted report, it seemed noteworthy that the unadjusted player pair data showed Mason, West and Brown all played better with Claxton overall than with Paul overall (and overall 7 of 10 main Hornets better with Claxton than Paul); and the team played equal in one case and better in 2 cases for these players without Paul than with Paul. Lots of data that contributed to my impression Paul wasnt a simple story of major positive impact. But all other things are not equal in the unadjusted dataset and appearances can potentially be deceiving and mask individual impact so I welcome the new information.
Claxton and Snyder played the same positions as Paul and are fairly close to as strong on your adjusted +/-. Are they all to be considered really strong players? Or merely the nicest players on that team?
Complicated story to try to unravel but adjusted +/- can push the search along.
Could the 4 factors on/off numbers for players perhaps be adjusted as well? I've been looking at the data reported at 82games knowing it is unadjusted and deserves caution but still have given what it shows some weight. Adjusted data would make a stronger case exactly where players are making their team +/- impacts.
Paul's unadjusted on/off comparison showed team was weaker on own eFG% with him on, weaker on opponent eFG with him on, weaker on own off. rebounding, stronger on defensive rebounding, neutral for net FTAs, and a little lower on own turnovers with him on, neutral for opponent turnovers. Which of these move in a more positive direction for him under adjusted analysis (to account for it looking better thanb unadjusted +/-) and how much from each factor?
I'd be interested in your revised Paul/Claxton adjusted +/- estimates ...
I've rerun the regression with all players in it, except Bonzi Wells whom I've chosen as the reference (zero rating) player. No significant changes in Hornets data. Ratings are on average 2 points higher but that's because the reference has changed (the first time it was all players with duplicated names). With Bonzi as reference, 70% of players have a positive rating. Substract 3 points from all values if you prefer a 50%-50% distribution.
Quote:
But was Paul burdened with weaker than normal teammates compared to Hornet totals? Comparing your results to just the defensive difference would indicate how much this is a factor. Mason last season I grant was a liability to play beside alot. The other three biggest pairings on the court for Paul though were West, Brown and Claxton. Were they a burden? Doesnt seem like it- Claxton was 2nd best adjusted +/- in your reported numbers and Brown and West combined close to neutral (if I am reading it right). Or is time with and not with not other teammates somehow decisive?
Claxton played 55% of the time with Paul, and 50% with Mason. Paul only 40% with Claxton but 60% with Mason. The difference probably lies there.
Quote:
Claxton and Snyder played the same positions as Paul and are fairly close to as strong on your adjusted +/-. Are they all to be considered really strong players? Or merely the nicest players on that team?
I already said that these data is not good enough for comparisons across teams. I should have also said that, even within a team, comparisons across positions should be done with care. There can be odd results if there isn't enough mixture of players. I'm sure Dan Rosenbaum can explain it much better than me. I'm far from an expert in this stuff, and don't feel that comfortable with my interpretation. I'll give it anyway. In the Hornets case, there's a good mixture of guards and at the PF-C spots. SG-SF relationship is probably based on the 9% of minutes that Mason played at SG; the Hornets sucked in that scenario (-20 ppg), so the regression 'concludes' that Mason is much worse than all other guards. For SF-PF, there's around 250 minutes of Butler playing inside, with good results for the team. The intuitive -and simplistic- conclusion would be that Butler is better than West, Brown and co. All in all, the ratings probably overestimate Hornets guards, and understimate their big men. How much? I can't tell, but Vroman rating tells me it's not that bad. Anyway, I'm confident with what I was looking for, the Paul-Claxton comparison and how misleading unadjusted +/- can be.
Quote:
Could the 4 factors on/off numbers for players perhaps be adjusted as well?
Sure they can. You just need to define a reasonable reggresion and -the hardest part- get the input data. Not something I can do .
I find the way you're counting possessions sorta confusing. We're all accustomed to "per 100 possessions" stats. It would make more sense to me to see the numbers broken down by offensive performance (pts per 100 offensive possessions), defensive performance (pts per 100 defensive possessions), and then have a net +/- per 100 possessions.
Sorry for the delay in responding to all the posts, I'll try and address all the various points.
The reason I approached the problem as I have described is that I'm ultimately planning on producing adjusted +/- stats based on the results of all the "mini-games" that a matchup represent. I have to think this through a little more, but I fear that breaking it up into 100 offensive possessions disconnected from 100 defensive possessions might make that task harder. I'm having a hard enought time as it it. At the same time, I see the point of breaking it up as you and deepak_e have described. I can't promise I'll do this right away, but I'll look into it.
Here's a comparison of the numbers from your site and some of the other frequently used sites here:
Code:
Lakers
min PF PA Poss
BV.com 3971 8154 7949 15656
82games.com 3964 8161 7950 14905
B-R.com 3971 8154 7949 --
Code:
Rockets
min PF PA Poss
BV.com 3966 7387 7517 15215
82games.com 3958 7388 7521 14347
B-R.com 3966 7387 7517 --
I think B-R.com's tally of minutes, points forced, and points allowed are precisely from the box scores, so your answers on those are clearly correct. However, there is a discrepency between the total possessions from your site, and the possessions from 82games.com (which I calculated using the player On/Off data).
This indicates to me that it's not a discrepency in how matchups are being calculated and tracked, but rather in how possessions are being counted. Either your site is overcounting them, or 82games is undercounting them.
deepak_e,
Thanks for looking this up for me. It's reassuring to see the total for BV.com match BR.com exactly. I want to be clear, though, that there are still some smaller errors inside the play-by-play (thanks to Justin for pointing that out).
As far as possessions, with the automated code I've written, I simply look for a switch in the game log from citing the home team to the away team or vice versa (every line starts with "[TTT" where TTT is the three letter abbreviation of the team). The exceptions I ignore are fouls, timeouts, substitutions, and violations. As a result, I certainly include those 0.1 second possessions at the ends of quarters that Dean mentioned. I think this captures most of the situations that have been mentioned in this thread.
Last edited by basketballvalue on Mon Oct 02, 2006 10:43 pm; edited 1 time in total
I'm sorry but there was a couple of mistakes in the above explanation. The first line of "names.txt" must be "Name". You can enter it manually or use the updated Perl script above (which now also does the required conversion of spaces into dots).
The second mistake is that there are duplicated names in the players.txt file (like D. Jones, M. Williams or D. West), and I wasn't handling it (this may have affected the Hornets players' ratings; I'll rerun the regression and report about it). The only solution here is to manually edit db.txt so that the PlayerTrueName becomes a unique identifier. There's something like 10-12 duplicated names. If possible, I'd request Aaron and the people at basketballvalue to define unique names for all players, as it's easier to work with them than with numerical ids.
We've actually worked to create unique id's. The challenge is that the names aren't unique across the season. When Miami plays a typical team, Shaq is referred to as "O'Neal". However, when they play Indiana, with Jermaine O'Neal, suddenly Shaq is "S. O'Neal". So, what the database shows is that the same player ID can be either of these names on Miami, but the true name is "S. O'Neal" since that's more specific. Sorry if that makes some of the subsquent analysis harder, I thought it would be more valuable to let people see all the names that the game logs use for a player.
Ugh, cherokee why use R? Do you have SPSS? Menu driven is always better, IMO.
As for your calculations, can you explain what are the values that you are using for the regression? Are they +/-? If so, where do the SE's come from? _________________ Statistics are like a woman's bikini. What it reveals can be fascinating, but what it conceals is ultimately critical!
Joined: 05 Jan 2005 Posts: 329 Location: Columbus, OH
Posted: Tue Oct 03, 2006 8:58 am Post subject:
gabefarkas wrote:
Ugh, cherokee why use R? Do you have SPSS? Menu driven is always better, IMO.
Menu-driven = less flexibility. I much prefer packages that have their own programming languages (e.g., SAS and R). _________________ Regards,
Justin Kubatko
Basketball Stats!
where margin in my case is the point differential per 48 minutes, instead of the efficiency differential per 100 possessions. The source of the SE is the same as in Dan's reggression, isn't it?.
So, what the database shows is that the same player ID can be either of these names on Miami, but the true name is "S. O'Neal" since that's more specific.
I noticed and appreciate that. It's good to have a single name for each player, and be able to match it with the different names used in the pbps. What I was asking is for such PlayerTrueName to be unique. As it is now, your database uses J. Jones for both James Jones and Jumaine Jones, so you are forced to look at/work with numerical ids to differentiate them. I'm just saying this would be helpful, but isn't really so important (82games is worse in this respect). I can live with what you have know.
All times are GMT - 5 Hours Goto page Previous1, 2, 3, 4Next
Page 2 of 4
You cannot post new topics in this forum You cannot reply to topics in this forum You cannot edit your posts in this forum You cannot delete your posts in this forum You cannot vote in polls in this forum