APBRmetrics Forum Index APBRmetrics
The statistical revolution will not be televised.
 
 FAQFAQ   SearchSearch   MemberlistMemberlist   UsergroupsUsergroups   RegisterRegister 
 ProfileProfile   Log in to check your private messagesLog in to check your private messages   Log inLog in 

Within Game Win Expectancy
Goto page Previous  1, 2, 3  Next
 
Post new topic   Reply to topic    APBRmetrics Forum Index -> General discussion
View previous topic :: View next topic  
Author Message
Ed Küpfer



Joined: 30 Dec 2004
Posts: 642
Location: Toronto

PostPosted: Thu Mar 09, 2006 7:24 pm    Post subject: Reply with quote

gabefarkas wrote:
In any case, from what you've got, do you think you could reasonably come up with values for those 5 variables that satisfy that formula, with a tolerable level or error? From your earlier post, it seems as though the equation would have linear, quadratic, and cubic terms for both variables. Is that correct?


Yeah. I used a logistic regression model (it's up there in the second post of this thread), which takes the form:

p = 1 / (1 + EXP(-b))

where b = all the variables (time, time^2, time^3, etc) weighted by their regression coefficients. I don't know how familiar you are with logistic regression, but it's used on events that have binary outcomes (eg win/loss), and returns a nice s-curve bounded at 0 and 1.
_________________
ed
Back to top
View user's profile Send private message
gabefarkas



Joined: 31 Dec 2004
Posts: 974
Location: Durham, NC

PostPosted: Thu Mar 09, 2006 10:22 pm    Post subject: Reply with quote

yeah, i know log regression stuff somewhat. it can also be used with Poisson regression models, such as for counts data, or rate data.

that's part of where i was going with my previous post. i think maybe you could try remodeling using the Poisson assumption to simplify it.
Back to top
View user's profile Send private message Send e-mail AIM Address
Ed Küpfer



Joined: 30 Dec 2004
Posts: 642
Location: Toronto

PostPosted: Thu Mar 09, 2006 11:01 pm    Post subject: Reply with quote

gabefarkas wrote:
that's part of where i was going with my previous post. i think maybe you could try remodeling using the Poisson assumption to simplify it.


Okay, my turn to ask you to explain. I'm not too familiar with Poisson models. How would you turn it into a probability of a binary outcome?
_________________
ed
Back to top
View user's profile Send private message
gabefarkas



Joined: 31 Dec 2004
Posts: 974
Location: Durham, NC

PostPosted: Thu Mar 09, 2006 11:08 pm    Post subject: Reply with quote

well, the outcome would be modeled as a Poisson, rather than as a Binary.

let me give it some more thought and get back to you.
Back to top
View user's profile Send private message Send e-mail AIM Address
farbror



Joined: 13 Oct 2005
Posts: 15
Location: Sweden

PostPosted: Fri Mar 10, 2006 1:40 am    Post subject: Reply with quote

Ed Küpfer wrote:
I randomly sampled many observations from within these games, over half a million. I don't know if correlation is much of an issue
(my bold)

My gut feel is that correlation is a major issue! A standrad logistic regression would be based on the assumption that the data points are independent. Data from the same game are not.

With 1000+ games available you might want to validate your results by sampling a single data point from each game and then do the Logistic regression.

.....and then perhaps repeat the validation a few times?

Do you in any way Model the strengths of the involved teams? Falling a few points behind, say, Portland of Today might be easier to overcome than trailing Detroit.

Poission regression: Poission regression is an excellent Model for soccer scores and hockey score. You might need to do some clever stuff with the dispersion parameter if you try to model hoops using poission regression.
Back to top
View user's profile Send private message
Ed Küpfer



Joined: 30 Dec 2004
Posts: 642
Location: Toronto

PostPosted: Fri Mar 10, 2006 1:09 pm    Post subject: Reply with quote

farbror wrote:
My gut feel is that correlation is a major issue! A standrad logistic regression would be based on the assumption that the data points are independent. Data from the same game are not.


I understand what you're saying, but I still don't see how it is a big issue. Think of the question we're trying to answer: given a home team lead of T, and M minutes remaining in the game, what is the probability of a home team win? I can't see how sampling repeatedly from the same game, but at different points, affects the answer here.

farbror wrote:
With 1000+ games available you might want to validate your results by sampling a single data point from each game and then do the Logistic regression.

.....and then perhaps repeat the validation a few times?


Okay, I did this. The problem with this approach is that there are not nearly enough data to give a significant regression result. For example, I repeated the process of sampling a single point from each game 10 times, and I still haven't sampled a single datapoint that has a home team lead with 5-10 minutes remaining. Think of all the possible Time/Lead combinations: say home team leads between -15 and 15, 48 minutes in a (actually, I recorded the time down to the second) will give us 1500 possibilities, which means that every sample will have an average of a single datapoint per Time/Lead combination possibility. This won't tell us anything.

I don't want to dismiss your objections out of hand. But I'm still not sure if a) the correlation issue really makes a difference (I'm not very familiar with the problems inherent in the resampling approach I used), and b) a practical alternative can be concieved. So far, the approached I used at least matches or intuitive feel of how the numbers should look, for whatever that's worth.
_________________
ed
Back to top
View user's profile Send private message
farbror



Joined: 13 Oct 2005
Posts: 15
Location: Sweden

PostPosted: Mon Mar 13, 2006 4:24 am    Post subject: Reply with quote

Ed>>

Robust estimates of correlation structures for Repeated measurements has been my field of research for some time. It is rather tricky (and I try to deal with simple stuff). The major quirk is that it is really hard to realize when the correlation structure has a major impact on the results.

If 1000+ data points are too few to get significant results, then that is a very interesting finding in itself. It might be an indication that other factors than "time remaining" and "score" are (even more) important predictors.

I appreciate your efforts to investigate this interesting topic. Also, I am very greatfull that you share the results.

Cheers, farbror
Sweden
Back to top
View user's profile Send private message
Ed Küpfer



Joined: 30 Dec 2004
Posts: 642
Location: Toronto

PostPosted: Wed Mar 15, 2006 12:10 pm    Post subject: Reply with quote

farbror, I tried this again, this time focusing on a) the final minute of games, and b) the final 2 minutes of games. Neither pass gave me significant results. I think what I have to do is re-visit this issue when I have more data. Probably this summer I'll have added two more seasons' worth to work with.

For now, all I can say is that the results above seem to conform to my intuition. I prefer to look at it as a useful, pragmatic hack, rather than a refelction of reality. I promise not to put any more confidence than it deserves.
_________________
ed
Back to top
View user's profile Send private message
Jon Cohodas



Joined: 08 Jul 2005
Posts: 31
Location: Richmond, VA

PostPosted: Fri Apr 28, 2006 3:44 pm    Post subject: Reply with quote

Ed,
This dataset is pure gold. I need to be brief right now, (I promise to folllow up after I am off the clock today) but here are some of the things I did with a with a very similar dataset of college football games. Some of these things you probably have already tried.

* Rather than parameterizing the model using the regression, I created an "empirical matrix" of time remaining verses margin. In other words there would be an empirical cell that says that "empirically" (I'm making up numbers here), when the home team is up by 8 points with exactly 5 minutes left, since they won 10/16 times, that cell would have a "empirical" probability of .625.

Question: Are these recorded events whenever the score changed, or are other events included as well? I ask because, if it is just changes of score, then it should be easy to fill in the blanks for all of the times in between scores.

* One way to get the data to be a little smoother where you do not have many observations without resorting to parameterization is to set up a Markov transition matrix for each time/delta. This basically means that the prob(W|t,d) is a function of the the sum of the different prob(W|t-1).

* I love the game graphs! A simple but very telling statistic is what I called the gamescore which is the the integral of your graph. As you stated, this statistic "scores" the game on the change of probability over time and is useful at colapsing blowouts and getting at the "true" closeness of a game. I found that for college football using this statistic was better than using Margin Of Victory (MOV) in predicting future matchups.

Question 2: Would you be willing to provide a version of the data with the teams involved? That would make it possible to give each game a gamescore.

* One more thing I pursued was once I had gamescores for a season, it was able to estimate a MOV based on gamescores. This was helpful for those who might want to *ahem* predict a MOV for whatever reason. Smile
Back to top
View user's profile Send private message
Jon Cohodas



Joined: 08 Jul 2005
Posts: 31
Location: Richmond, VA

PostPosted: Fri May 05, 2006 3:35 pm    Post subject: Reply with quote

Quote:
I have every game from 04-05, 1230 of them. I randomly sampled many observations from within these games, over half a million. I don't know if correlation is much of an issue—the question being asked is, given home team lead L and time remaining in game T, what is the probability of a home team win? I think the method I used was good enough to answer that, at least provisionally, until we add some more observations from other seasons.


Ed,
Excuse me for being dense, but are you saying that the 650,000+ observations are not each time/margin observation in the sample, picked once, but rather 650,000+ independent samples from the dataset, including oversampling?

Would it be unseamly for me to beg for even the reduced dataset that contains each gameid, time, home score and visitor score? I would like to take a crack at replicating the time/delta matrix and also try and generate the probability graphs for individual games.


[/quote]
Back to top
View user's profile Send private message
Jon Cohodas



Joined: 08 Jul 2005
Posts: 31
Location: Richmond, VA

PostPosted: Fri May 05, 2006 4:15 pm    Post subject: How one might "smooth" the time/margin matrix Reply with quote

Since there was concern that there might not be enough observations for any one particular time/margin cell to get a good estimate of the probability, here's one way to smooth the data a bit. It is my attempt at a rewrite of something similar I did with college football data.

For notational purposes let p(T,d) be the probability of winning with T seconds remaining with a lead of d (delta). By definition, p(0,>0) = 1, and p(0,<0) = 0. (Overtime, as you noted, is tricky. I would just set p(0,0) to be whatever the empirical probability of the home team winning in overtime.)

Suppose that there were 20 instances where a team was leading by one
point with one second remaining. Now for the sake of simplicity, assume that there were only 3 possible outcomes for the final second. There was a 18/20=90% chance that the lead will not change, a 1/20=5% chance that the lead will go to 3 (The team with the lead scores another field goal), and a 1/20=5% chance that the other team will lead by 1 (The trailing team scores). In this example, the probability of winning given a one point lead with one second remaining is:

p(1,1)= p(0,1)*.9 + p(0,3)*.05 + p(0,-1)*.05
= 1 *.9 + 1 *.05 + 0 *.05 = .95.

Suppose one did this for every margin with one second left. Then the probability of winning after with two seconds remaining p(2,d) given different point differentials would be calculated using the p(1,d) from above. In other words, the probabilities are being modelled as a Markov process.

Another way of looking at this is that instead of comparing a T/D with the final result, you are just comparing it with the states at T+1.

This method will give some smoothness for situations where say they were down 20 at the half, rallied back to within 5 with a minute to go and lost. Instead of just tabulating this as down 20 at the half therefore lost, the probability would be based on probability that one could win down 5 with a minute to go.
Back to top
View user's profile Send private message
Ed Küpfer



Joined: 30 Dec 2004
Posts: 642
Location: Toronto

PostPosted: Sat May 06, 2006 2:05 pm    Post subject: Reply with quote

Jon: I just saw these posts now. I don't have time to read them closely right now, but it looks like a lot of good stuff. Expect a real reply within a couple of days.
_________________
ed
Back to top
View user's profile Send private message
suburbanDad



Joined: 10 May 2006
Posts: 1

PostPosted: Thu May 11, 2006 8:27 am    Post subject: NBA within game different from NCAA? Reply with quote

This is brilliant work Ed.

I wonder whether the within game odds are different for the NBA. Is it difficult to get the NBA PBPs?

Also, ball possession seems important. Three points down at 0:12 is very different with the ball than without. I see that you didn't include ball possession. Is that because you didn't have it in your NCAA dataset?

sD
Back to top
View user's profile Send private message
Jon Cohodas



Joined: 08 Jul 2005
Posts: 31
Location: Richmond, VA

PostPosted: Thu May 11, 2006 2:15 pm    Post subject: Reply with quote

Quote:
I wonder whether the within game odds are different for the NBA. Is it difficult to get the NBA PBPs?

Also, ball possession seems important. Three points down at 0:12 is very different with the ball than without. I see that you didn't include ball possession. Is that because you didn't have it in your NCAA dataset?

I am not Ed, but I hope he doesn't mind my answering.

I'm quite certain that Ed used NBA and not NCAA data.

Getting NBA PBPs are not difficult. They are found at nba.com, espn.com, and a few other places. The trick is parsing them. I started to take a crack at it myself a few months back, but my data was corrupted and I did not repursue it at the time.

I believe what Ed did was sample from the lines of the PBP where a score took place, so by definition, the possession was with the team that scored. If one was to look at the continum of time in between scores, one would have to note each change of possession that did not involve a score in order to do the analysis of the ball possession.
Back to top
View user's profile Send private message
gabefarkas



Joined: 31 Dec 2004
Posts: 974
Location: Durham, NC

PostPosted: Thu May 11, 2006 5:56 pm    Post subject: Reply with quote

Ed Küpfer wrote:
gabefarkas wrote:
that's part of where i was going with my previous post. i think maybe you could try remodeling using the Poisson assumption to simplify it.


Okay, my turn to ask you to explain. I'm not too familiar with Poisson models. How would you turn it into a probability of a binary outcome?


I realized I never got back to you about this. What you've done here is a binomial logit (logistic regression) model, with the form:

logit(Pi) = log (Pi / (1 - Pi) ) = a + b1x1 + b2x2 + ...

This model ensures that the response will be between 0 and 1.

A Poisson loglinear model predicts the expected value of "y" (the response variable), and takes the form:

log(E(y)) = a + b1x1 + b2x2 + ...

And it's used for counts of things, or rate data, or also when putting together a contingency table. So, you couldn't use it for a binary outcome, but if you have the total number of games, you could model the rate of success.

Loglinear and logit models have a lot of connections between them, and oftentimes there's an equivalent version of one that can be found in the other.
Back to top
View user's profile Send private message Send e-mail AIM Address
Display posts from previous:   
Post new topic   Reply to topic    APBRmetrics Forum Index -> General discussion All times are GMT - 5 Hours
Goto page Previous  1, 2, 3  Next
Page 2 of 3

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum


Powered by phpBB © 2001, 2005 phpBB Group