APBRmetrics Forum Index APBRmetrics
The statistical revolution will not be televised.
 
 FAQFAQ   SearchSearch   MemberlistMemberlist   UsergroupsUsergroups   RegisterRegister 
 ProfileProfile   Log in to check your private messagesLog in to check your private messages   Log inLog in 

Within Game Win Expectancy
Goto page 1, 2, 3  Next
 
Post new topic   Reply to topic    APBRmetrics Forum Index -> General discussion
View previous topic :: View next topic  
Author Message
Ed Küpfer



Joined: 30 Dec 2004
Posts: 642
Location: Toronto

PostPosted: Sun Feb 12, 2006 2:53 am    Post subject: Within Game Win Expectancy Reply with quote

Okay, I assembled a ton of numbers, and posted the data:
http://ca.geocities.com/edkupfer/basketballstuff/PBPScoringData.txt

The file contains just the raw data, about 7500 unique Time Remaining/Home Team Lead combinations, and about one-quarter million total observations.

I'll be looking at it more closely, and post any results in this thread, but if anyone thinks they have a generalised solultion, or a way of getting a better fit to the data than any of my attempts, please give it a shot.
_________________
ed
Back to top
View user's profile Send private message
Ed Küpfer



Joined: 30 Dec 2004
Posts: 642
Location: Toronto

PostPosted: Tue Feb 14, 2006 12:40 am    Post subject: Reply with quote

Okay. I got a good result using a logit model with cubed (!) variables. It's ugly, but it's the best I got so far.

Code:
Logistic Regression Table

Predictor           Coef    SE Coef       Z      P
Constant       0.0001040  0.0105958    0.01  0.992
Min^1          0.0238027  0.0021063   11.30  0.000
Min^2         -0.0006059  0.0001038   -5.84  0.000
Min^3          0.0000064  0.0000014    4.50  0.000
Lead^1          0.137276  0.0010688  128.44  0.000
Lead^2        -0.0003527  0.0001139   -3.10  0.002
Lead^3        -0.0002829  0.0000125  -22.60  0.000
(Lead^1)/Min    0.171210  0.0044060   38.86  0.000
(Lead^2)/Min   0.0066804  0.0009404    7.10  0.000
(Lead^3)/Min   0.0069239  0.0001444   47.96  0.000


Min = Minutes remaining = MINUTES + SECONDS/60
Lead = Home Team lead

Here's how it looks for a home team lead over the final three minutes.



Note that the home court advantage dissapears near the end.
_________________
ed
Back to top
View user's profile Send private message
mtamada



Joined: 28 Jan 2005
Posts: 186

PostPosted: Tue Feb 14, 2006 2:28 am    Post subject: Reply with quote

Fabulous. I haven't had time to play around with the formula that you derived, but the numbers look plausible.

Have you talked about this with DeanO? I know that one of his research areas, at least as of a year or two ago, was within-game probabilities-of-winning, although I think he was more interested in a discrete game-state approach (e.g. with 30 seconds left, home team has the ball and a 2 point deficit, should they go for a quick shot to get a 2-for-1, or work the regular offense, or try to shoot a 3-pointer?).

His and your approach might complement or supplement each other real well.
Back to top
View user's profile Send private message
tenkev



Joined: 31 Jul 2005
Posts: 13
Location: Memphis,TN

PostPosted: Tue Feb 14, 2006 2:37 am    Post subject: Reply with quote

I think this is absolutely fantastic.

I've had an idea that relates to this for some time.

If you can calculate the expected winning % at any given time during the game based on point differential, time remaining and possesion then you can make a metric that would blow DanVal out of the water.

Dan's regression formula for deriving his player rating is
MARGIN=b0 + b1X1 + b2X2 + . . . + bKXK + e, where
MARGIN=100*(home team points per possession – away team points per possession)

Well, what if instead of the margin being the difference in points per possesion while a unit is on the floor, why not make it the difference in expected winning %?

This way, you could account for the fact that points in a close ball game are more valuable than in a blow out, and a game winning shot is more valuable than another shot, etc.

What do you think? It would take alot of work, but if somebody did it that would be the best possible player rating, IMO.
Back to top
View user's profile Send private message Send e-mail AIM Address
Ed Küpfer



Joined: 30 Dec 2004
Posts: 642
Location: Toronto

PostPosted: Tue Feb 14, 2006 3:47 am    Post subject: Reply with quote

tenkev wrote:
Well, what if instead of the margin being the difference in points per possesion while a unit is on the floor, why not make it the difference in expected winning %?

This way, you could account for the fact that points in a close ball game are more valuable than in a blow out, and a game winning shot is more valuable than another shot, etc.


I seem to recall that DanR put a "clutch" modifier in his model somewhere. But, yes, I think that for a comprehensive rating system, using changes in win probablity as the response variable is preferable to using points.

Quote:
What do you think? It would take alot of work, but if somebody did it that would be the best possible player rating, IMO.


It would take much more work. The stuff I've done — to the extext that I've done anything at all — is coarse. Some problems:

1. I've only used one season of data. That can't be good. This can be addressed soon.

2. Possession isn't indicated, but is clearly an important variable towards the end lof the game. This is harder to address, because I don't have an automatic way of digging possession out of the PBPs the way I did with score changes.

3. You'd still need the other data DanR used: the identity of the other players on the floor.

4. Credit needs to be given out. If the probablitity of a home win increases by 0.2 on a single possession, who gets what credit? Half should be deducted from the defense, obviously, but should it be shared equally among all defenders? Should a single defender be credited? Same thing for the offense, although there it's probably less problematic to assign credit.

I'm envisioning a smaller scale usage. Maybe a game-level analysis, done one game at a time by any interested fan. This would eliminate most of the problems, since the fan could manually code the missing data. For example, tonight I watched the Raptors at the Wolves, and it seemed to me that KG was a terrifying defensive presence. Since I watched the game, I could print out a PBP and code his defensive assignments manually, along with most of the other players. This type of thing could be done on a larger scale for the playoffs.

I think I'm going to try doing a single game, just to see what kind of problems come up. The Raps are in NY on Wednesday, a game which promises to exhaust my supply of boredom, but maybe scoring the game by hand this way will perk things up.
_________________
ed
Back to top
View user's profile Send private message
Ed Küpfer



Joined: 30 Dec 2004
Posts: 642
Location: Toronto

PostPosted: Tue Feb 14, 2006 3:52 am    Post subject: Reply with quote

mtamada wrote:
I think he was more interested in a discrete game-state approach (e.g. with 30 seconds left, home team has the ball and a 2 point deficit, should they go for a quick shot to get a 2-for-1, or work the regular offense, or try to shoot a 3-pointer?).


I'll drop him a line, unless he wants to pipe up here...

I created a spreadsheet once which simulated the last few minutes of a game, focusing on 2- or 3-pt strategies. I really enjoyed working through it, and although I left lots of variables out, I saw at the time how it could be modified to include more — given a model to base it on. I still think I need more data (as noted in my reply to tenkev) but it should be workable, if I get off my lazy butt to collect more data.
_________________
ed
Back to top
View user's profile Send private message
Tmon



Joined: 09 Oct 2005
Posts: 9
Location: Boston

PostPosted: Fri Feb 17, 2006 4:57 pm    Post subject: Reply with quote

Beautiful stuff, Ed!. Thanks for making the data available as well. Few quesetions/comments:

1. Interesting to note that last year, 4/5 home teams won when time stopped, down by 1 pt with 1 sec remaining! Anomally I'm sure, but never say die!

2. Conversely, only 6/11 home teams won when winning by 1 pt with 2 secs left when time stopped. Never relax!

3. Were the other "negative lead" data included in the regression, just not shown on the chart? If negative leads were included in the regression, the "lead^2" and "lead^2/min" terms change the sign, causing logic problems.

4. Inclusion of the "min" "min^2" and "min^3" variables seems a bit off logically to me. I realize the "p" values look good... But, the chance of winning should increase at lower time remaining values, so the inverse time terms (lead/min) you include later make more theoretical sense to me, and note those terms have much higher coefficients and coefficient*variable values.

5. For the logistic regression: usually the dependent input is 0 or 1. I think the regression would be more rigorous if the whole data set was broken out, instead of collapsed into say, 110 observations at 1 minute lead of 5, 55 wins, for 50%, which is then weighed as heavily in the regression as a time/lead combo with just one observation for 100%. Or perhaps you did this, and the text file was collapsed for convenience?

6. Finally, I am playing with this data using MatLAB, and the logistic code I have does not provide "p" values (or anything but the coefficient). Is there a chance anybody has more complete logisitic code for MATLAB?

All that said, none of the regressions I've done so far give anything that looks as logical as your chart.

-Tmon
Back to top
View user's profile Send private message
Ed Küpfer



Joined: 30 Dec 2004
Posts: 642
Location: Toronto

PostPosted: Fri Feb 17, 2006 5:20 pm    Post subject: Reply with quote

Tmon wrote:
3. Were the other "negative lead" data included in the regression, just not shown on the chart? If negative leads were included in the regression, the "lead^2" and "lead^2/min" terms change the sign, causing logic problems.


Hmm. Before, I included the proper sign even with squared variables (var^2 = var * |var|). I'm not sure why I didn't do it this time. Might be worth trying again.

Quote:
4. Inclusion of the "min" "min^2" and "min^3" variables seems a bit off logically to me. I realize the "p" values look good... But, the chance of winning should increase at lower time remaining values, so the inverse time terms (lead/min) you include later make more theoretical sense to me, and note those terms have much higher coefficients and coefficient*variable values.


There's nothing theoretical about what I did. I tried a bunch of different variables and intereaction variables until the results looked good. This was harder than I thought — I never thought I would have to cube anything. If you can think of a way to fit a curve to the data using a more theoretical approach, I would appreciate it. I'm not comfortable with what I have so far.

Quote:
5. For the logistic regression: usually the dependent input is 0 or 1. I think the regression would be more rigorous if the whole data set was broken out, instead of collapsed into say, 110 observations at 1 minute lead of 5, 55 wins, for 50%, which is then weighed as heavily in the regression as a time/lead combo with just one observation for 100%. Or perhaps you did this, and the text file was collapsed for convenience?

I used Minitab for the regressions, being much quicker and easier than the more hardcore stats packages I have on my computer. Minitab allows me to use the number of games to weight the results of the outcomes. Imagine my surprise when I found out that other, more complex packages don't allow this as an option on the regressions.

So to answer your question, each game was used as an obeservation in the regression. I don't know how you'd "unstack" the observations from my data — the way I presented them was pretty much the way I collected them. I suppose you could run a macro to copy each observation g times, where g is the number of games.
_________________
ed
Back to top
View user's profile Send private message
Ed Küpfer



Joined: 30 Dec 2004
Posts: 642
Location: Toronto

PostPosted: Sat Feb 18, 2006 11:54 am    Post subject: Reply with quote

Update:

I've doubled the number of observations to the data set. It's now about 650,000. I've also uploaded a zip file containing the same data in "unstacked" format, so that every game observation is on its own row, with a binary win/loss outcome in the final column. Any stats package should now be able to handle this without a problem — as long as it can handle 650,000 rows.

http://ca.geocities.com/edkupfer/basketballstuff/DiffUnstacked.zip
_________________
ed
Back to top
View user's profile Send private message
Tmon



Joined: 09 Oct 2005
Posts: 9
Location: Boston

PostPosted: Wed Feb 22, 2006 4:17 pm    Post subject: Reply with quote

woa nelly! Thanks for unstacking all that. I'm taking and playing with as much data as I can at a time. Definitely can't get all 650,000, doesn't even let me try. I tried for 5 minutes on, and it let me get it in there, then crashed. Maybe I can look at different leads at specific time points one at a time or something.

-Tmon
Back to top
View user's profile Send private message
farbror



Joined: 13 Oct 2005
Posts: 15
Location: Sweden

PostPosted: Thu Mar 09, 2006 4:08 am    Post subject: Reply with quote

This is really interesting stuff! How do you model the correlation structure for the reapeated measures? I am assuming that you have multiple data points from several games?
Back to top
View user's profile Send private message
Ed Küpfer



Joined: 30 Dec 2004
Posts: 642
Location: Toronto

PostPosted: Thu Mar 09, 2006 11:43 am    Post subject: Reply with quote

farbror wrote:
How do you model the correlation structure for the reapeated measures? I am assuming that you have multiple data points from several games?


I have every game from 04-05, 1230 of them. I randomly sampled many observations from within these games, over half a million. I don't know if correlation is much of an issue—the question being asked is, given home team lead L and time remaining in game T, what is the probability of a home team win? I think the method I used was good enough to answer that, at least provisionally, until we add some more observations from other seasons.
_________________
ed
Back to top
View user's profile Send private message
Tmon



Joined: 09 Oct 2005
Posts: 9
Location: Boston

PostPosted: Thu Mar 09, 2006 6:38 pm    Post subject: Reply with quote

Ed,

I'm still liking this stuff a lot, and I think you are essentially there for end-of-game situations. However, I was wondering if it would be possible to simplify and choose times that are commonly stated landmarks, such as halftime and end-of-three. My gut says people often put far too much emphasis on the score at these landmark times. Your current formula probably is just as valid at these times, but doing this would also reduce the number of variables, of course, so I can play too Smile .

You could just use every game, instead of randomly sampling. I think there are too-few observations at these times in the massive data file you posted to really get a good picture if I pull these out selectively. Any chance you're interested? Gotta figure this out so we can yell at the tv when they say "xx holds a commanding lead at the half"!

-Tmon
Back to top
View user's profile Send private message
gabefarkas



Joined: 31 Dec 2004
Posts: 976
Location: Durham, NC

PostPosted: Thu Mar 09, 2006 6:53 pm    Post subject: Reply with quote

Ed Küpfer wrote:
farbror wrote:
How do you model the correlation structure for the reapeated measures? I am assuming that you have multiple data points from several games?


I have every game from 04-05, 1230 of them. I randomly sampled many observations from within these games, over half a million. I don't know if correlation is much of an issue—the question being asked is, given home team lead L and time remaining in game T, what is the probability of a home team win? I think the method I used was good enough to answer that, at least provisionally, until we add some more observations from other seasons.


So, correct me if I'm wrong, but essentially you could have the following:

Probability(HomeTeamWin) = ( ( x * (L^a) ) + ( y * (T^b) ) ) * E

where x, y, a and b are integers that establish the output probability to be within a certain range and/or threshold, and E is a normalizing component or "fudge factor" to bring the bounds to {0, 1}, making it a true probability.

Perhaps you might even need the natural log of the above to smooth it out.

In any case, from what you've got, do you think you could reasonably come up with values for those 5 variables that satisfy that formula, with a tolerable level or error? From your earlier post, it seems as though the equation would have linear, quadratic, and cubic terms for both variables. Is that correct?
Back to top
View user's profile Send private message Send e-mail AIM Address
Ed Küpfer



Joined: 30 Dec 2004
Posts: 642
Location: Toronto

PostPosted: Thu Mar 09, 2006 7:17 pm    Post subject: Reply with quote

Tmon wrote:
I was wondering if it would be possible to simplify and choose times that are commonly stated landmarks, such as halftime and end-of-three.


Like this?

Code:
                                      TIME REMAINING
   
         EndQ1   Half  EndQ3  10:00   5:00   3:00   2:00   1:00   0:40   0:30   0:20   0:10
   
     20    .92    .96   1.00   1.00   1.00   1.00   1.00   1.00   1.00   1.00   1.00   1.00
     15    .90    .92    .97    .98   1.00   1.00   1.00   1.00   1.00   1.00   1.00   1.00
     12    .87    .88    .93    .94    .99   1.00   1.00   1.00   1.00   1.00   1.00   1.00
     10    .84    .85    .88    .90    .95    .99   1.00   1.00   1.00   1.00   1.00   1.00
H     9    .83    .83    .86    .87    .93    .97    .99   1.00   1.00   1.00   1.00   1.00
O     8    .81    .81    .83    .84    .89    .94    .97   1.00   1.00   1.00   1.00   1.00
M     7    .79    .79    .80    .81    .85    .90    .94    .99   1.00   1.00   1.00   1.00
E     6    .77    .76    .77    .77    .80    .85    .90    .97    .99   1.00   1.00   1.00
      5    .74    .74    .73    .73    .75    .79    .84    .93    .97    .99   1.00   1.00
T     4    .72    .71    .70    .70    .71    .73    .77    .86    .92    .95    .99   1.00
E     3    .69    .68    .66    .66    .66    .67    .70    .77    .83    .87    .94    .99
A     2    .66    .65    .63    .62    .61    .62    .63    .67    .72    .76    .83    .94
M     1    .63    .61    .59    .58    .57    .57    .57    .59    .61    .63    .67    .78
      0    .59    .58    .55    .55    .53    .52    .51    .51    .50    .50    .50    .50
L    -1    .56    .54    .51    .51    .48    .47    .46    .43    .41    .39    .34    .24
E    -2    .52    .51    .48    .47    .44    .42    .40    .35    .31    .27    .20    .08
A    -3    .49    .47    .44    .43    .39    .36    .34    .26    .21    .16    .09    .01
D    -4    .45    .43    .40    .39    .35    .31    .27    .18    .11    .07    .03    .00
     -5    .42    .40    .36    .35    .30    .25    .20    .10    .05    .02    .00    .00
     -6    .39    .36    .32    .31    .25    .19    .13    .05    .02    .00    .00    .00
     -7    .35    .33    .28    .27    .19    .13    .08    .02    .00    .00    .00    .00
     -8    .33    .30    .24    .22    .15    .08    .04    .00    .00    .00    .00    .00
     -9    .30    .27    .21    .19    .10    .05    .02    .00    .00    .00    .00    .00
    -10    .27    .24    .17    .15    .07    .02    .01    .00    .00    .00    .00    .00
    -12    .23    .19    .11    .09    .02    .00    .00    .00    .00    .00    .00    .00
    -15    .18    .13    .05    .03    .00    .00    .00    .00    .00    .00    .00    .00
    -20    .14    .07    .01    .00    .00    .00    .00    .00    .00    .00    .00    .00


Since all that represents the probability of an average home team beating an average away team, it's more interesting is to use the numbers above to modify the log5 formula, like this:

Probability of Home Team Win = (HomeWin * (1 - AwayWin) * W) / (HomeWin * (1 - AwayWin) * W + (1 HomeWin) * AwayWin * (1 - W))

where HomeWin and AwayWin represent some estimate of the Home and Away teams' win ability (like their Win% or Pythagorean or something), and
W = some HCA weight. Normally, we use simple HCA, which is about 0.6, but the win expectancy equation returns a more precise weight, given the game circumstances.

For example, if two average teams are playing, and the home team has a 5 point lead at halftime, they have a 0.74 probability of a win. But if the home team is the Lakers (Win% = 0.5) and the away team is the Hawks (Win% = 0.3), then the home team win probability is

Code:
p(HomeWin) = (0.5 * (1 - 0.3) * 0.74) / (0.5 * (1 - 0.3) * 0.74 + (1 - 0.5) * 0.3 * (1 - 0.74))
            = 0.87

_________________
ed
Back to top
View user's profile Send private message
Display posts from previous:   
Post new topic   Reply to topic    APBRmetrics Forum Index -> General discussion All times are GMT - 5 Hours
Goto page 1, 2, 3  Next
Page 1 of 3

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum


Powered by phpBB © 2001, 2005 phpBB Group