APBRmetrics Forum Index APBRmetrics
The statistical revolution will not be televised.
 
 FAQFAQ   SearchSearch   MemberlistMemberlist   UsergroupsUsergroups   RegisterRegister 
 ProfileProfile   Log in to check your private messagesLog in to check your private messages   Log inLog in 

New website: BasketballValue.com
Goto page Previous  1, 2, 3, 4  Next
 
Post new topic   Reply to topic    APBRmetrics Forum Index -> General discussion
View previous topic :: View next topic  
Author Message
cherokee_ACB



Joined: 22 Mar 2006
Posts: 44

PostPosted: Sun Oct 01, 2006 2:31 pm    Post subject: Reply with quote

Thanks a lot, Aaron. This is extremely valuable, both the PbP's and the matchup data. Having said that, I second the motion to split offensive and defensive possessions.

Anyway, I couldn't help but run a quick regression on the aggregate matchups data to roughly compute adjusted +/- ratings. It lacks all the tunings that Dan Rosenbaum applies (offensive vs defensive possessions, reference players, garbage time weighting, positional and team adjustments, ...) but still produces interesting results, in particular to compare players of the same team. Case in point, the Hornets:

Code:

Player        Estimate  Std. Error t val Pr(>|t|)     Net on/off
Paul             14.7       5.6     2.6    0.0 **      -1.0
Claxton          12.7       5.3     2.4    0.0  *       7.8
Snyder           11.8       5.5     2.1    0.0  *       5.6
Vroman            5.3       6.8     0.8    0.4          6.4
J.R. Smith        5.0       5.6     0.9    0.4         -4.0
Macijauskas       4.0      10.1     0.4    0.7          0.3
P. Brown          2.4       5.7     0.4    0.7         -1.2
M. Jackson        2.2       5.7     0.4    0.7         -6.0
Nachbar           1.5       6.5     0.2    0.8          3.6
R. Butler         0.9       5.2     0.2    0.9          2.0
David West        0.4       5.6     0.1    0.9         -0.9
M. Norris         0.3       6.7     0.0    1.0         -9.7
L. Johnson       -0.1       6.5     0.0    1.0         -1.9
Andersen         -0.3       6.4     0.0    1.0          1.6
A. Williams      -1.7       6.0    -0.3    0.8          2.6
Mason            -5.2       5.5    -0.9    0.3         -6.5
Bass             -6.4       7.7    -0.8    0.4         -6.8
M. Lampe        -15.2      19.3    -0.8    0.4        -13.1
Fizer           -20.2      17.3    -1.2    0.2         -8.1


This clearly suggests that the main reason for Chris Paul bad net on/off numbers is the players with whom he was on the court, and not that he had a negative impact on his teammates. Paul has the 15th best rating in the league, just ahead of Wade, Artest, Diaw, Iguodala and Duncan. Take that with a grain of salt because, as I said, in this raw format the ratings are not good enough to compare across teams.

I'm looking forward to seeing the more polished adjusted +/- ratings at basketballvalue, as I believe they are one of the most telling stats. By the way, do you plan to use data from previous seasons to compute them?


Last edited by cherokee_ACB on Mon Oct 02, 2006 4:44 pm; edited 4 times in total
Back to top
View user's profile Send private message
deepak_e



Joined: 26 Apr 2006
Posts: 200

PostPosted: Sun Oct 01, 2006 3:27 pm    Post subject: Reply with quote

cherokee_ACB wrote:
Anyway, I couldn't help but run a quick regression on the aggregate matchups data to roughly compute adjusted +/- ratings.


Interesting stuff. Could you explain in some more detail how you went about doing this?
Back to top
View user's profile Send private message
cherokee_ACB



Joined: 22 Mar 2006
Posts: 44

PostPosted: Sun Oct 01, 2006 4:02 pm    Post subject: Reply with quote

deepak_e wrote:

Interesting stuff. Could you explain in some more detail how you went about doing this?


Sure. It's easier said than done, but let's go with it. A perl script produces the Xi coefficients.

Code:

my %players;
my @onCourt;

# 1. Let's load player names. db.txt players.txt with no header lines nor last line
open(DB, "<db.txt");

print "\"Seconds\",\"gDif\"";
while (<DB>) {
    chomp $_;
    $id = substr($_, 7, 3);
    $name = substr($_, 31, 16);
    $name =~ s/\s+$//;  # Right trim
    $players{$id}=$name;
}
close DB;

# Write names to header and to second file
open(NAMES, ">names.txt");
print NAMES "Name\n";
foreach $key (sort(keys(%players))){
    $name = $players{$key};
    print ",\"" . $name . "\"";
    $name =~ s/( |-)/./;
    $name =~ s/O'Neal/O.Neal/;
    print NAMES $name . "\n";
}
close(NAMES);
print "\n";

# 2. Let's now convert the matchups into +1 / 0 / -1 coefficients
# aggs.txt is aggregatematchups.txt minus initial and final lines
open(MATCHUPS, "<aggs.txt");   

while ($matchup = <MATCHUPS>) {
    $seconds = substr($matchup, 425, 4);
    $dif = substr($matchup, 507, 4);

    # Home and away units
    for ($i=0; $i<5; $i++) {
        @home[$i] = substr($matchup, 33 + $i*16, 3);
        @away[$i] = substr($matchup, 117 + $i*20, 3);
    }

    $gDif = 48*60*$dif/$seconds;
    print $seconds;
    print "," . $gDif;

    # Write it down
     foreach $key (sort(keys(%players))){
        $playing=0;
        for ($i=0; $i<5; $i++) {
            if ($key == $home[$i])  {
               $playing=1;
            }
            elsif ($key == $away[$i])  {
               $playing=-1;
            }
        }
        print "," . $playing;
    }
    print "\n";           
}



Make sure you create db.txt (see above) and aggs.txt from the basketballvalue.net files, and redirect the output to another file, like this
$ perl nba_adj.pl > obser.csv

This will take some time. obser.csv will be a huge (63 MB) text file full of ones and zeros. Once the script has finished, go to names.txt, and replace whitespaces and other odd characters (like - and ') with dots (.), or the regression won't work.

Now, I use R for the regression. The commands to enter are
Code:

setwd("/your/working/directory")
obser<-read.table("obser.csv", header=TRUE, sep=",")
attach(obser)
names<-read.table("names.txt", header=TRUE)
attach(names)
fmla <- as.formula(paste("gDif ~ ", paste(Name, collapse="+")))
fm<-lm(fmla, weight=Seconds)
summary(fm)


Unfortunately, I couldn't get this to work in Windows because of memory issues. So I switched to Linux, where the constant paging made it awfully slow, but eventually I got the results. By my estimation, a minimum of 2 GB of virtual memory (physical memory + swap space) is required. Needless to say, the more RAM you have, the better.

And that's all. That was my quick regression.

P.S.: if someone wants to try anyway in Windows, enter "memory.limit(2000)" as the first command or you'll surely run out of memory.


Last edited by cherokee_ACB on Thu Nov 23, 2006 1:30 pm; edited 4 times in total
Back to top
View user's profile Send private message
deepak_e



Joined: 26 Apr 2006
Posts: 200

PostPosted: Sun Oct 01, 2006 4:10 pm    Post subject: Reply with quote

Thanks for taking the time to explain it.
Back to top
View user's profile Send private message
cherokee_ACB



Joined: 22 Mar 2006
Posts: 44

PostPosted: Sun Oct 01, 2006 5:34 pm    Post subject: Reply with quote

I'm sorry but there was a couple of mistakes in the above explanation. The first line of "names.txt" must be "Name". You can enter it manually or use the updated Perl script above (which now also does the required conversion of spaces into dots).

The second mistake is that there are duplicated names in the players.txt file (like D. Jones, M. Williams or D. West), and I wasn't handling it (this may have affected the Hornets players' ratings; I'll rerun the regression and report about it). The only solution here is to manually edit db.txt so that the PlayerTrueName becomes a unique identifier. There's something like 10-12 duplicated names. If possible, I'd request Aaron and the people at basketballvalue to define unique names for all players, as it's easier to work with them than with numerical ids.
Back to top
View user's profile Send private message
Mark



Joined: 20 Aug 2005
Posts: 670

PostPosted: Mon Oct 02, 2006 12:10 am    Post subject: Reply with quote

I'd be interested in your revised Paul/Claxton adjusted +/- estimates and those determined by basketballvalue and Dan as well. If a higher adjusted +/- for Paul stands / is generally agreed to that would be a strong example for the need for this information. But breakout and interpretation is still in order.

Paul's negative unadjusted +/- and lower win % on the court than team as a whole could indeed be affected by weaker than average teammates and / or stronger opponents. I accept the stronger opponent argument a la the data deepak presented about Kobe, both probably play heavily against starters. (I'd be interested a similar run on Paul if you are willing.)

But was Paul burdened with weaker than normal teammates compared to Hornet totals? Comparing your results to just the defensive difference would indicate how much this is a factor. Mason last season I grant was a liability to play beside alot. The other three biggest pairings on the court for Paul though were West, Brown and Claxton. Were they a burden? Doesnt seem like it- Claxton was 2nd best adjusted +/- in your reported numbers and Brown and West combined close to neutral (if I am reading it right). Or is time with and not with not other teammates somehow decisive?

This information may have been superceded, but prior to receiving the adjusted report, it seemed noteworthy that the unadjusted player pair data showed Mason, West and Brown all played better with Claxton overall than with Paul overall (and overall 7 of 10 main Hornets better with Claxton than Paul); and the team played equal in one case and better in 2 cases for these players without Paul than with Paul. Lots of data that contributed to my impression Paul wasnt a simple story of major positive impact. But all other things are not equal in the unadjusted dataset and appearances can potentially be deceiving and mask individual impact so I welcome the new information.

Claxton and Snyder played the same positions as Paul and are fairly close to as strong on your adjusted +/-. Are they all to be considered really strong players? Or merely the nicest players on that team?

Complicated story to try to unravel but adjusted +/- can push the search along.

Could the 4 factors on/off numbers for players perhaps be adjusted as well? I've been looking at the data reported at 82games knowing it is unadjusted and deserves caution but still have given what it shows some weight. Adjusted data would make a stronger case exactly where players are making their team +/- impacts.

Paul's unadjusted on/off comparison showed team was weaker on own eFG% with him on, weaker on opponent eFG with him on, weaker on own off. rebounding, stronger on defensive rebounding, neutral for net FTAs, and a little lower on own turnovers with him on, neutral for opponent turnovers. Which of these move in a more positive direction for him under adjusted analysis (to account for it looking better thanb unadjusted +/-) and how much from each factor?
Back to top
View user's profile Send private message
cherokee_ACB



Joined: 22 Mar 2006
Posts: 44

PostPosted: Mon Oct 02, 2006 5:10 pm    Post subject: Reply with quote

Mark wrote:
I'd be interested in your revised Paul/Claxton adjusted +/- estimates ...


I've rerun the regression with all players in it, except Bonzi Wells whom I've chosen as the reference (zero rating) player. No significant changes in Hornets data. Ratings are on average 2 points higher but that's because the reference has changed (the first time it was all players with duplicated names). With Bonzi as reference, 70% of players have a positive rating. Substract 3 points from all values if you prefer a 50%-50% distribution.

Quote:
But was Paul burdened with weaker than normal teammates compared to Hornet totals? Comparing your results to just the defensive difference would indicate how much this is a factor. Mason last season I grant was a liability to play beside alot. The other three biggest pairings on the court for Paul though were West, Brown and Claxton. Were they a burden? Doesnt seem like it- Claxton was 2nd best adjusted +/- in your reported numbers and Brown and West combined close to neutral (if I am reading it right). Or is time with and not with not other teammates somehow decisive?


Claxton played 55% of the time with Paul, and 50% with Mason. Paul only 40% with Claxton but 60% with Mason. The difference probably lies there.

Quote:
Claxton and Snyder played the same positions as Paul and are fairly close to as strong on your adjusted +/-. Are they all to be considered really strong players? Or merely the nicest players on that team?


I already said that these data is not good enough for comparisons across teams. I should have also said that, even within a team, comparisons across positions should be done with care. There can be odd results if there isn't enough mixture of players. I'm sure Dan Rosenbaum can explain it much better than me. I'm far from an expert in this stuff, and don't feel that comfortable with my interpretation. I'll give it anyway. In the Hornets case, there's a good mixture of guards and at the PF-C spots. SG-SF relationship is probably based on the 9% of minutes that Mason played at SG; the Hornets sucked in that scenario (-20 ppg), so the regression 'concludes' that Mason is much worse than all other guards. For SF-PF, there's around 250 minutes of Butler playing inside, with good results for the team. The intuitive -and simplistic- conclusion would be that Butler is better than West, Brown and co. All in all, the ratings probably overestimate Hornets guards, and understimate their big men. How much? I can't tell, but Vroman rating tells me it's not that bad. Anyway, I'm confident with what I was looking for, the Paul-Claxton comparison and how misleading unadjusted +/- can be.

Quote:
Could the 4 factors on/off numbers for players perhaps be adjusted as well?


Sure they can. You just need to define a reasonable reggresion and -the hardest part- get the input data. Not something I can do .
Back to top
View user's profile Send private message
Mark



Joined: 20 Aug 2005
Posts: 670

PostPosted: Mon Oct 02, 2006 7:49 pm    Post subject: Reply with quote

Thanks for the original post and followup. Gave a glimpse at the prospects for the data and a taste of the challenges interpreting it.
Back to top
View user's profile Send private message
basketballvalue



Joined: 07 Mar 2006
Posts: 13

PostPosted: Mon Oct 02, 2006 10:22 pm    Post subject: Reply with quote

WizardsKev wrote:
I find the way you're counting possessions sorta confusing. We're all accustomed to "per 100 possessions" stats. It would make more sense to me to see the numbers broken down by offensive performance (pts per 100 offensive possessions), defensive performance (pts per 100 defensive possessions), and then have a net +/- per 100 possessions.


Sorry for the delay in responding to all the posts, I'll try and address all the various points.

The reason I approached the problem as I have described is that I'm ultimately planning on producing adjusted +/- stats based on the results of all the "mini-games" that a matchup represent. I have to think this through a little more, but I fear that breaking it up into 100 offensive possessions disconnected from 100 defensive possessions might make that task harder. I'm having a hard enought time as it it. Smile At the same time, I see the point of breaking it up as you and deepak_e have described. I can't promise I'll do this right away, but I'll look into it.

Thanks,
Aaron





Thanks,
Aaron
Back to top
View user's profile Send private message
basketballvalue



Joined: 07 Mar 2006
Posts: 13

PostPosted: Mon Oct 02, 2006 10:36 pm    Post subject: Reply with quote

deepak_e wrote:
Here's a comparison of the numbers from your site and some of the other frequently used sites here:

Code:

Lakers
              min   PF    PA    Poss   
BV.com        3971  8154  7949  15656 
82games.com   3964  8161  7950  14905
B-R.com       3971  8154  7949   --


Code:

Rockets
              min   PF    PA    Poss   
BV.com        3966  7387  7517  15215
82games.com   3958  7388  7521  14347
B-R.com       3966  7387  7517   --


I think B-R.com's tally of minutes, points forced, and points allowed are precisely from the box scores, so your answers on those are clearly correct. However, there is a discrepency between the total possessions from your site, and the possessions from 82games.com (which I calculated using the player On/Off data).

This indicates to me that it's not a discrepency in how matchups are being calculated and tracked, but rather in how possessions are being counted. Either your site is overcounting them, or 82games is undercounting them.


deepak_e,

Thanks for looking this up for me. It's reassuring to see the total for BV.com match BR.com exactly. I want to be clear, though, that there are still some smaller errors inside the play-by-play (thanks to Justin for pointing that out).

As far as possessions, with the automated code I've written, I simply look for a switch in the game log from citing the home team to the away team or vice versa (every line starts with "[TTT" where TTT is the three letter abbreviation of the team). The exceptions I ignore are fouls, timeouts, substitutions, and violations. As a result, I certainly include those 0.1 second possessions at the ends of quarters that Dean mentioned. I think this captures most of the situations that have been mentioned in this thread.


Last edited by basketballvalue on Mon Oct 02, 2006 10:43 pm; edited 1 time in total
Back to top
View user's profile Send private message
basketballvalue



Joined: 07 Mar 2006
Posts: 13

PostPosted: Mon Oct 02, 2006 10:41 pm    Post subject: Reply with quote

cherokee_ACB wrote:
I'm sorry but there was a couple of mistakes in the above explanation. The first line of "names.txt" must be "Name". You can enter it manually or use the updated Perl script above (which now also does the required conversion of spaces into dots).

The second mistake is that there are duplicated names in the players.txt file (like D. Jones, M. Williams or D. West), and I wasn't handling it (this may have affected the Hornets players' ratings; I'll rerun the regression and report about it). The only solution here is to manually edit db.txt so that the PlayerTrueName becomes a unique identifier. There's something like 10-12 duplicated names. If possible, I'd request Aaron and the people at basketballvalue to define unique names for all players, as it's easier to work with them than with numerical ids.


We've actually worked to create unique id's. The challenge is that the names aren't unique across the season. When Miami plays a typical team, Shaq is referred to as "O'Neal". However, when they play Indiana, with Jermaine O'Neal, suddenly Shaq is "S. O'Neal". So, what the database shows is that the same player ID can be either of these names on Miami, but the true name is "S. O'Neal" since that's more specific. Sorry if that makes some of the subsquent analysis harder, I thought it would be more valuable to let people see all the names that the game logs use for a player.

Thanks,
Aaron
Back to top
View user's profile Send private message
gabefarkas



Joined: 31 Dec 2004
Posts: 506
Location: NYC

PostPosted: Tue Oct 03, 2006 5:11 am    Post subject: Reply with quote

Ugh, cherokee why use R? Do you have SPSS? Menu driven is always better, IMO.

As for your calculations, can you explain what are the values that you are using for the regression? Are they +/-? If so, where do the SE's come from?
_________________
Statistics are like a woman's bikini. What it reveals can be fascinating, but what it conceals is ultimately critical!
Back to top
View user's profile Send private message Send e-mail AIM Address
jkubatko



Joined: 05 Jan 2005
Posts: 329
Location: Columbus, OH

PostPosted: Tue Oct 03, 2006 8:58 am    Post subject: Reply with quote

gabefarkas wrote:
Ugh, cherokee why use R? Do you have SPSS? Menu driven is always better, IMO.


Menu-driven = less flexibility. I much prefer packages that have their own programming languages (e.g., SAS and R).
_________________
Regards,
Justin Kubatko
Basketball Stats!
Back to top
View user's profile Send private message Send e-mail Visit poster's website
cherokee_ACB



Joined: 22 Mar 2006
Posts: 44

PostPosted: Tue Oct 03, 2006 12:05 pm    Post subject: Reply with quote

@gabefarkas

I don't have SPSS, and R does the job good enough for me. The regression I use is the one described by Dan Rosenbaum
http://www.uncg.edu/bae/people/rosenbaum/NBA/winval2.htm

MARGIN = b0 + b1X1 + b2X2 + . . . + bKXK + e

where margin in my case is the point differential per 48 minutes, instead of the efficiency differential per 100 possessions. The source of the SE is the same as in Dan's reggression, isn't it?.
Back to top
View user's profile Send private message
cherokee_ACB



Joined: 22 Mar 2006
Posts: 44

PostPosted: Tue Oct 03, 2006 1:15 pm    Post subject: Reply with quote

basketballvalue wrote:
So, what the database shows is that the same player ID can be either of these names on Miami, but the true name is "S. O'Neal" since that's more specific.


I noticed and appreciate that. It's good to have a single name for each player, and be able to match it with the different names used in the pbps. What I was asking is for such PlayerTrueName to be unique. As it is now, your database uses J. Jones for both James Jones and Jumaine Jones, so you are forced to look at/work with numerical ids to differentiate them. I'm just saying this would be helpful, but isn't really so important (82games is worse in this respect). I can live with what you have know.
Back to top
View user's profile Send private message
Display posts from previous:   
Post new topic   Reply to topic    APBRmetrics Forum Index -> General discussion All times are GMT - 5 Hours
Goto page Previous  1, 2, 3, 4  Next
Page 2 of 4

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum


Powered by phpBB © 2001, 2005 phpBB Group