
APBRmetrics The statistical revolution will not be televised.

View previous topic :: View next topic 
Author 
Message 
Ed Küpfer
Joined: 30 Dec 2004 Posts: 787 Location: Toronto

Posted: Sun Feb 12, 2006 2:53 am Post subject: Within Game Win Expectancy 


Okay, I assembled a ton of numbers, and posted the data:
http://ca.geocities.com/edkupfer/basketballstuff/PBPScoringData.txt
The file contains just the raw data, about 7500 unique Time Remaining/Home Team Lead combinations, and about onequarter million total observations.
I'll be looking at it more closely, and post any results in this thread, but if anyone thinks they have a generalised solultion, or a way of getting a better fit to the data than any of my attempts, please give it a shot. _________________ ed 

Back to top 


Ed Küpfer
Joined: 30 Dec 2004 Posts: 787 Location: Toronto

Posted: Tue Feb 14, 2006 12:40 am Post subject: 


Okay. I got a good result using a logit model with cubed (!) variables. It's ugly, but it's the best I got so far.
Code:  Logistic Regression Table
Predictor Coef SE Coef Z P
Constant 0.0001040 0.0105958 0.01 0.992
Min^1 0.0238027 0.0021063 11.30 0.000
Min^2 0.0006059 0.0001038 5.84 0.000
Min^3 0.0000064 0.0000014 4.50 0.000
Lead^1 0.137276 0.0010688 128.44 0.000
Lead^2 0.0003527 0.0001139 3.10 0.002
Lead^3 0.0002829 0.0000125 22.60 0.000
(Lead^1)/Min 0.171210 0.0044060 38.86 0.000
(Lead^2)/Min 0.0066804 0.0009404 7.10 0.000
(Lead^3)/Min 0.0069239 0.0001444 47.96 0.000

Min = Minutes remaining = MINUTES + SECONDS/60
Lead = Home Team lead
Here's how it looks for a home team lead over the final three minutes.
Note that the home court advantage dissapears near the end. _________________ ed 

Back to top 


mtamada
Joined: 28 Jan 2005 Posts: 377

Posted: Tue Feb 14, 2006 2:28 am Post subject: 


Fabulous. I haven't had time to play around with the formula that you derived, but the numbers look plausible.
Have you talked about this with DeanO? I know that one of his research areas, at least as of a year or two ago, was withingame probabilitiesofwinning, although I think he was more interested in a discrete gamestate approach (e.g. with 30 seconds left, home team has the ball and a 2 point deficit, should they go for a quick shot to get a 2for1, or work the regular offense, or try to shoot a 3pointer?).
His and your approach might complement or supplement each other real well. 

Back to top 


tenkev
Joined: 31 Jul 2005 Posts: 20 Location: Memphis,TN

Posted: Tue Feb 14, 2006 2:37 am Post subject: 


I think this is absolutely fantastic.
I've had an idea that relates to this for some time.
If you can calculate the expected winning % at any given time during the game based on point differential, time remaining and possesion then you can make a metric that would blow DanVal out of the water.
Dan's regression formula for deriving his player rating is
MARGIN=b0 + b1X1 + b2X2 + . . . + bKXK + e, where
MARGIN=100*(home team points per possession – away team points per possession)
Well, what if instead of the margin being the difference in points per possesion while a unit is on the floor, why not make it the difference in expected winning %?
This way, you could account for the fact that points in a close ball game are more valuable than in a blow out, and a game winning shot is more valuable than another shot, etc.
What do you think? It would take alot of work, but if somebody did it that would be the best possible player rating, IMO. 

Back to top 


Ed Küpfer
Joined: 30 Dec 2004 Posts: 787 Location: Toronto

Posted: Tue Feb 14, 2006 3:47 am Post subject: 


tenkev wrote:  Well, what if instead of the margin being the difference in points per possesion while a unit is on the floor, why not make it the difference in expected winning %?
This way, you could account for the fact that points in a close ball game are more valuable than in a blow out, and a game winning shot is more valuable than another shot, etc. 
I seem to recall that DanR put a "clutch" modifier in his model somewhere. But, yes, I think that for a comprehensive rating system, using changes in win probablity as the response variable is preferable to using points.
Quote:  What do you think? It would take alot of work, but if somebody did it that would be the best possible player rating, IMO. 
It would take much more work. The stuff I've done — to the extext that I've done anything at all — is coarse. Some problems:
1. I've only used one season of data. That can't be good. This can be addressed soon.
2. Possession isn't indicated, but is clearly an important variable towards the end lof the game. This is harder to address, because I don't have an automatic way of digging possession out of the PBPs the way I did with score changes.
3. You'd still need the other data DanR used: the identity of the other players on the floor.
4. Credit needs to be given out. If the probablitity of a home win increases by 0.2 on a single possession, who gets what credit? Half should be deducted from the defense, obviously, but should it be shared equally among all defenders? Should a single defender be credited? Same thing for the offense, although there it's probably less problematic to assign credit.
I'm envisioning a smaller scale usage. Maybe a gamelevel analysis, done one game at a time by any interested fan. This would eliminate most of the problems, since the fan could manually code the missing data. For example, tonight I watched the Raptors at the Wolves, and it seemed to me that KG was a terrifying defensive presence. Since I watched the game, I could print out a PBP and code his defensive assignments manually, along with most of the other players. This type of thing could be done on a larger scale for the playoffs.
I think I'm going to try doing a single game, just to see what kind of problems come up. The Raps are in NY on Wednesday, a game which promises to exhaust my supply of boredom, but maybe scoring the game by hand this way will perk things up. _________________ ed 

Back to top 


Ed Küpfer
Joined: 30 Dec 2004 Posts: 787 Location: Toronto

Posted: Tue Feb 14, 2006 3:52 am Post subject: 


mtamada wrote:  I think he was more interested in a discrete gamestate approach (e.g. with 30 seconds left, home team has the ball and a 2 point deficit, should they go for a quick shot to get a 2for1, or work the regular offense, or try to shoot a 3pointer?). 
I'll drop him a line, unless he wants to pipe up here...
I created a spreadsheet once which simulated the last few minutes of a game, focusing on 2 or 3pt strategies. I really enjoyed working through it, and although I left lots of variables out, I saw at the time how it could be modified to include more — given a model to base it on. I still think I need more data (as noted in my reply to tenkev) but it should be workable, if I get off my lazy butt to collect more data. _________________ ed 

Back to top 


Tmon
Joined: 09 Oct 2005 Posts: 9 Location: Boston

Posted: Fri Feb 17, 2006 4:57 pm Post subject: 


Beautiful stuff, Ed!. Thanks for making the data available as well. Few quesetions/comments:
1. Interesting to note that last year, 4/5 home teams won when time stopped, down by 1 pt with 1 sec remaining! Anomally I'm sure, but never say die!
2. Conversely, only 6/11 home teams won when winning by 1 pt with 2 secs left when time stopped. Never relax!
3. Were the other "negative lead" data included in the regression, just not shown on the chart? If negative leads were included in the regression, the "lead^2" and "lead^2/min" terms change the sign, causing logic problems.
4. Inclusion of the "min" "min^2" and "min^3" variables seems a bit off logically to me. I realize the "p" values look good... But, the chance of winning should increase at lower time remaining values, so the inverse time terms (lead/min) you include later make more theoretical sense to me, and note those terms have much higher coefficients and coefficient*variable values.
5. For the logistic regression: usually the dependent input is 0 or 1. I think the regression would be more rigorous if the whole data set was broken out, instead of collapsed into say, 110 observations at 1 minute lead of 5, 55 wins, for 50%, which is then weighed as heavily in the regression as a time/lead combo with just one observation for 100%. Or perhaps you did this, and the text file was collapsed for convenience?
6. Finally, I am playing with this data using MatLAB, and the logistic code I have does not provide "p" values (or anything but the coefficient). Is there a chance anybody has more complete logisitic code for MATLAB?
All that said, none of the regressions I've done so far give anything that looks as logical as your chart.
Tmon 

Back to top 


Ed Küpfer
Joined: 30 Dec 2004 Posts: 787 Location: Toronto

Posted: Fri Feb 17, 2006 5:20 pm Post subject: 


Tmon wrote:  3. Were the other "negative lead" data included in the regression, just not shown on the chart? If negative leads were included in the regression, the "lead^2" and "lead^2/min" terms change the sign, causing logic problems. 
Hmm. Before, I included the proper sign even with squared variables (var^2 = var * var). I'm not sure why I didn't do it this time. Might be worth trying again.
Quote:  4. Inclusion of the "min" "min^2" and "min^3" variables seems a bit off logically to me. I realize the "p" values look good... But, the chance of winning should increase at lower time remaining values, so the inverse time terms (lead/min) you include later make more theoretical sense to me, and note those terms have much higher coefficients and coefficient*variable values.

There's nothing theoretical about what I did. I tried a bunch of different variables and intereaction variables until the results looked good. This was harder than I thought — I never thought I would have to cube anything. If you can think of a way to fit a curve to the data using a more theoretical approach, I would appreciate it. I'm not comfortable with what I have so far.
Quote:  5. For the logistic regression: usually the dependent input is 0 or 1. I think the regression would be more rigorous if the whole data set was broken out, instead of collapsed into say, 110 observations at 1 minute lead of 5, 55 wins, for 50%, which is then weighed as heavily in the regression as a time/lead combo with just one observation for 100%. Or perhaps you did this, and the text file was collapsed for convenience?

I used Minitab for the regressions, being much quicker and easier than the more hardcore stats packages I have on my computer. Minitab allows me to use the number of games to weight the results of the outcomes. Imagine my surprise when I found out that other, more complex packages don't allow this as an option on the regressions.
So to answer your question, each game was used as an obeservation in the regression. I don't know how you'd "unstack" the observations from my data — the way I presented them was pretty much the way I collected them. I suppose you could run a macro to copy each observation g times, where g is the number of games. _________________ ed 

Back to top 


Ed Küpfer
Joined: 30 Dec 2004 Posts: 787 Location: Toronto

Posted: Sat Feb 18, 2006 11:54 am Post subject: 


Update:
I've doubled the number of observations to the data set. It's now about 650,000. I've also uploaded a zip file containing the same data in "unstacked" format, so that every game observation is on its own row, with a binary win/loss outcome in the final column. Any stats package should now be able to handle this without a problem — as long as it can handle 650,000 rows.
http://ca.geocities.com/edkupfer/basketballstuff/DiffUnstacked.zip _________________ ed 

Back to top 


Tmon
Joined: 09 Oct 2005 Posts: 9 Location: Boston

Posted: Wed Feb 22, 2006 4:17 pm Post subject: 


woa nelly! Thanks for unstacking all that. I'm taking and playing with as much data as I can at a time. Definitely can't get all 650,000, doesn't even let me try. I tried for 5 minutes on, and it let me get it in there, then crashed. Maybe I can look at different leads at specific time points one at a time or something.
Tmon 

Back to top 


farbror
Joined: 13 Oct 2005 Posts: 15 Location: Sweden

Posted: Thu Mar 09, 2006 4:08 am Post subject: 


This is really interesting stuff! How do you model the correlation structure for the reapeated measures? I am assuming that you have multiple data points from several games? 

Back to top 


Ed Küpfer
Joined: 30 Dec 2004 Posts: 787 Location: Toronto

Posted: Thu Mar 09, 2006 11:43 am Post subject: 


farbror wrote:  How do you model the correlation structure for the reapeated measures? I am assuming that you have multiple data points from several games? 
I have every game from 0405, 1230 of them. I randomly sampled many observations from within these games, over half a million. I don't know if correlation is much of an issue—the question being asked is, given home team lead L and time remaining in game T, what is the probability of a home team win? I think the method I used was good enough to answer that, at least provisionally, until we add some more observations from other seasons. _________________ ed 

Back to top 


Tmon
Joined: 09 Oct 2005 Posts: 9 Location: Boston

Posted: Thu Mar 09, 2006 6:38 pm Post subject: 


Ed,
I'm still liking this stuff a lot, and I think you are essentially there for endofgame situations. However, I was wondering if it would be possible to simplify and choose times that are commonly stated landmarks, such as halftime and endofthree. My gut says people often put far too much emphasis on the score at these landmark times. Your current formula probably is just as valid at these times, but doing this would also reduce the number of variables, of course, so I can play too .
You could just use every game, instead of randomly sampling. I think there are toofew observations at these times in the massive data file you posted to really get a good picture if I pull these out selectively. Any chance you're interested? Gotta figure this out so we can yell at the tv when they say "xx holds a commanding lead at the half"!
Tmon 

Back to top 


gabefarkas
Joined: 31 Dec 2004 Posts: 1313 Location: Durham, NC

Posted: Thu Mar 09, 2006 6:53 pm Post subject: 


Ed Küpfer wrote:  farbror wrote:  How do you model the correlation structure for the reapeated measures? I am assuming that you have multiple data points from several games? 
I have every game from 0405, 1230 of them. I randomly sampled many observations from within these games, over half a million. I don't know if correlation is much of an issue—the question being asked is, given home team lead L and time remaining in game T, what is the probability of a home team win? I think the method I used was good enough to answer that, at least provisionally, until we add some more observations from other seasons. 
So, correct me if I'm wrong, but essentially you could have the following:
Probability(HomeTeamWin) = ( ( x * (L^a) ) + ( y * (T^b) ) ) * E
where x, y, a and b are integers that establish the output probability to be within a certain range and/or threshold, and E is a normalizing component or "fudge factor" to bring the bounds to {0, 1}, making it a true probability.
Perhaps you might even need the natural log of the above to smooth it out.
In any case, from what you've got, do you think you could reasonably come up with values for those 5 variables that satisfy that formula, with a tolerable level or error? From your earlier post, it seems as though the equation would have linear, quadratic, and cubic terms for both variables. Is that correct? 

Back to top 


Ed Küpfer
Joined: 30 Dec 2004 Posts: 787 Location: Toronto

Posted: Thu Mar 09, 2006 7:17 pm Post subject: 


Tmon wrote:  I was wondering if it would be possible to simplify and choose times that are commonly stated landmarks, such as halftime and endofthree. 
Like this?
Code:  TIME REMAINING
EndQ1 Half EndQ3 10:00 5:00 3:00 2:00 1:00 0:40 0:30 0:20 0:10
20 .92 .96 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00
15 .90 .92 .97 .98 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00
12 .87 .88 .93 .94 .99 1.00 1.00 1.00 1.00 1.00 1.00 1.00
10 .84 .85 .88 .90 .95 .99 1.00 1.00 1.00 1.00 1.00 1.00
H 9 .83 .83 .86 .87 .93 .97 .99 1.00 1.00 1.00 1.00 1.00
O 8 .81 .81 .83 .84 .89 .94 .97 1.00 1.00 1.00 1.00 1.00
M 7 .79 .79 .80 .81 .85 .90 .94 .99 1.00 1.00 1.00 1.00
E 6 .77 .76 .77 .77 .80 .85 .90 .97 .99 1.00 1.00 1.00
5 .74 .74 .73 .73 .75 .79 .84 .93 .97 .99 1.00 1.00
T 4 .72 .71 .70 .70 .71 .73 .77 .86 .92 .95 .99 1.00
E 3 .69 .68 .66 .66 .66 .67 .70 .77 .83 .87 .94 .99
A 2 .66 .65 .63 .62 .61 .62 .63 .67 .72 .76 .83 .94
M 1 .63 .61 .59 .58 .57 .57 .57 .59 .61 .63 .67 .78
0 .59 .58 .55 .55 .53 .52 .51 .51 .50 .50 .50 .50
L 1 .56 .54 .51 .51 .48 .47 .46 .43 .41 .39 .34 .24
E 2 .52 .51 .48 .47 .44 .42 .40 .35 .31 .27 .20 .08
A 3 .49 .47 .44 .43 .39 .36 .34 .26 .21 .16 .09 .01
D 4 .45 .43 .40 .39 .35 .31 .27 .18 .11 .07 .03 .00
5 .42 .40 .36 .35 .30 .25 .20 .10 .05 .02 .00 .00
6 .39 .36 .32 .31 .25 .19 .13 .05 .02 .00 .00 .00
7 .35 .33 .28 .27 .19 .13 .08 .02 .00 .00 .00 .00
8 .33 .30 .24 .22 .15 .08 .04 .00 .00 .00 .00 .00
9 .30 .27 .21 .19 .10 .05 .02 .00 .00 .00 .00 .00
10 .27 .24 .17 .15 .07 .02 .01 .00 .00 .00 .00 .00
12 .23 .19 .11 .09 .02 .00 .00 .00 .00 .00 .00 .00
15 .18 .13 .05 .03 .00 .00 .00 .00 .00 .00 .00 .00
20 .14 .07 .01 .00 .00 .00 .00 .00 .00 .00 .00 .00 
Since all that represents the probability of an average home team beating an average away team, it's more interesting is to use the numbers above to modify the log5 formula, like this:
Probability of Home Team Win = (HomeWin * (1  AwayWin) * W) / (HomeWin * (1  AwayWin) * W + (1 HomeWin) * AwayWin * (1  W))
where HomeWin and AwayWin represent some estimate of the Home and Away teams' win ability (like their Win% or Pythagorean or something), and
W = some HCA weight. Normally, we use simple HCA, which is about 0.6, but the win expectancy equation returns a more precise weight, given the game circumstances.
For example, if two average teams are playing, and the home team has a 5 point lead at halftime, they have a 0.74 probability of a win. But if the home team is the Lakers (Win% = 0.5) and the away team is the Hawks (Win% = 0.3), then the home team win probability is
Code:  p(HomeWin) = (0.5 * (1  0.3) * 0.74) / (0.5 * (1  0.3) * 0.74 + (1  0.5) * 0.3 * (1  0.74))
= 0.87 
_________________ ed 

Back to top 




You cannot post new topics in this forum You cannot reply to topics in this forum You cannot edit your posts in this forum You cannot delete your posts in this forum You cannot vote in polls in this forum

Powered by phpBB © 2001, 2005 phpBB Group
