Below is a snapshot of the Web page as it appeared on 4/10/2011 (the last time our crawler visited it). This is the version of the page that was used for ranking your search results. The page may have changed since we last cached it. To see what might have changed (without the highlights), go to the current page.
Bing is not responsible for the content of this page.
APBRmetrics :: View topic - Predicting the future
APBRmetrics Forum Index APBRmetrics
The statistical revolution will not be televised.
 
 FAQFAQ   SearchSearch   MemberlistMemberlist   UsergroupsUsergroups   RegisterRegister 
 ProfileProfile   Log in to check your private messagesLog in to check your private messages   Log inLog in 

Predicting the future
Goto page 1, 2  Next
 
Post new topic   Reply to topic    APBRmetrics Forum Index -> General discussion
View previous topic :: View next topic  
Author Message
Ed Küpfer



Joined: 30 Dec 2004
Posts: 787
Location: Toronto

PostPosted: Thu Jan 06, 2005 9:34 pm    Post subject: Predicting the future Reply with quote

I hope to add some thoughts to this thread at irregular intervals.

To predict the outcome of a game, Bill James's "log5" method has been used traditionally:
Code:
Probability of Team A beating Team B = [(Team A win%) * (1 - Team B Win%)] /
                                       [(Team A win%) * (1 - Team B Win%)  +
                                        (1 - Team A win%) * (Team B Win%)]

If Team A is the home team, an adjustment for home court can be made by using the following:
Code:
Probability of Team A beating Team B =
[(Team A win%) * (1 - Team B Win%) * HCA] /
[(Team A win%) * (1 - Team B Win%) * HCA +
 (1 - Team A win%) * (Team B Win%) * (1 - HCA)]

where HCA = home court advantage, the percentage of games won by home teams -- in the NBA, usually about 0.6.

I took a large sample of regular games from 1974 and 2004. I removed games in which one of the two teams had played fewer than 40 games in that season, or played more than 70 -- this make sure both teams had a sufficient number of games to establish their ability, and to eliminate games towards the end of the season in which playoff-bound teams rested starters or teams already eliminated experimented with style of play or lineups. I also eliminated games played on neutral turf. This gave me a sample of 10,892 games.

For each game I calculated the probability of a home team win using the log5 method outlined above, with HCA = 0.6284 (the HCA in my sample). Instead of a team's actual win%, I used their Pythagorean Win% (exponent 14) at the date of the game. That is, I calculated each team's Pyth% using their points scored and allowed in games prior to the game being predicted.

I grouped these predictions into bins centered at 10% intervals, ie the "20%" bin shown below contains predictions between 15%-25%. I then calculated the actual percentage of home team wins for each bin. Here's how it looks:
Code:
Predicted   Won    Lost       %
  0.0%        0       0      0.0%
 10.0%       13      49     21.0%
 20.0%       74     249     22.9%
 30.0%      247     484     33.8%
 40.0%      517     685     43.0%
 50.0%      842     777     52.0%
 60.0%     1295     790     62.1%
 70.0%     1542     592     72.3%
 80.0%     1488     343     81.3%
 90.0%      798      76     91.3%
100.0%       29       2     93.5%

A system which predicted perfectly would have identical 1st and 5th columns. This one does pretty good, except for the extremes, where it breaks down. Also, it overestimates slightly at every category, for reasons which elude me.

This is what the data above look like graphically:

The diagonal line respresents perfect predictions. If anyone has any ideas why log5 consistently overestimates the probability of home wins, I'd like to hear it.

Coming soon will be a comparison of log5 to a prediction system built from binary logistic regression.
_________________
ed
Back to top
View user's profile Send private message Send e-mail
kjb



Joined: 03 Jan 2005
Posts: 865
Location: Washington, DC

PostPosted: Fri Jan 07, 2005 11:06 am    Post subject: Reply with quote

How does it do with predicting victory margins? Twisted Evil
Back to top
View user's profile Send private message AIM Address Yahoo Messenger
Dan Rosenbaum



Joined: 03 Jan 2005
Posts: 541
Location: Greensboro, North Carolina

PostPosted: Fri Jan 07, 2005 12:56 pm    Post subject: Reply with quote

This is very interesting work, ed. Thanks! I bet injuries and suspensions are the explanation for why the formula does poorly on the tails. In those cases the predictions are based upon players who are not playing in that given game, so it would not be surprising for the predictions to be off. For that reason I would not be too aggressive in pursuing other functional forms that would predict more wins in the tails. They may fit the data better but for the wrong reasons.
Back to top
View user's profile Send private message Send e-mail Visit poster's website Yahoo Messenger
Ed Küpfer



Joined: 30 Dec 2004
Posts: 787
Location: Toronto

PostPosted: Fri Jan 07, 2005 1:10 pm    Post subject: Reply with quote

WizardsKev: I don't see why the variables can't be regressed against the win margin instead of simply the win/loss binary outcome. We've seen from the Pythagorean method that win margins tell us something real about the quality of the teams, so this might be something worth persuing. I am in the middle of assembling more data to use as variables (rest days and distance traveled) so it will have to wait a few days.

Dan Rosenbaum: Thanks for the input. You're probably right that I shouldn't waste time on the results of extreme predictions. In the next day or two I'll post a logit model for game predictions, so maybe these extreme predictions will go away!
_________________
ed
Back to top
View user's profile Send private message Send e-mail
Ed Küpfer



Joined: 30 Dec 2004
Posts: 787
Location: Toronto

PostPosted: Fri Jan 07, 2005 4:34 pm    Post subject: Reply with quote

Using the same game data as above, I used Minitab's binary logistic function to model the outcomes. logistic regression is basically the same type of thing as linear regression, except that while linear regression is used on dependant variables that are continuous (like height, or points per game), logistic regression is used on outcomes that are of the win/loss, yes/no, hit/miss variety. My dependant variable was HomeWin (1=yes, 0=no). I'll append the Minitab output at the bottom of this post.

To show how well the logistic model predicted game outcomes in comparison to the log5 method mentioned previously, take a look at the following graph:

A perfect prediction model would lie right along the grey diagonal line. Both models work pretty well, but you can see that the logistic model is slightly better. If you'd like to use it to predict game outcomes, use the following equation:
Code:
 1 / (1 + EXP(-(0.59 + 4.4 * HOME% - 4.3 * AWAY%)))
where HOME% and AWAY% are some measures of home and away team strength, respectively -- I used Pythagorean projected win percentage in my calculations.

Up next: refining the model by adding days rest and travel distance variables.

Code:
MINITAB OUTPUT:

Binary Logistic Regression: HomeW versus HPyth, APyth

Link Function: Logit

Response Information

Variable  Value  Count
HomeW     1       6845  (Event)
          0       4047
          Total  10892


Logistic Regression Table

                                               Odds      95% CI
Predictor      Coef   SE Coef       Z      P  Ratio  Lower   Upper
Constant   0.594422  0.108072    5.50  0.000
HPyth       4.37510  0.158557   27.59  0.000  79.45  58.23  108.40
APyth      -4.29649  0.160249  -26.81  0.000   0.01   0.01    0.02


Log-Likelihood = -6386.726
Test that all slopes are zero: G = 1599.178, DF = 2, P-Value = 0.000

_________________
ed
Back to top
View user's profile Send private message Send e-mail
Ed Küpfer



Joined: 30 Dec 2004
Posts: 787
Location: Toronto

PostPosted: Fri Jan 14, 2005 4:19 am    Post subject: Reply with quote

Similar to the analysis above: binary logistic regression, using the following variables:

Code:
HPYTH          Home team Pythagorean win%
APYTH          Away team Pythagorean win%
HREST          Home team rest days before game
AREST          Away team rest days before game
HREST_2        (Home team rest days before game)^2
AREST_2        (Away team rest days before game)^2
HDIST          Home team distance traveled to game
ADIST          Away team distance traveled to game


(No graphs this time. Sorry, gang.)

A new data set was used, consisting of 11,206 games played between 1974 and 2004. Only games in which both teams had played a) 40 or more games in the season, and b) 70 or fewer games in the season. Additionally, games played at a neutral site were eliminated from the dataset.

For Distance Traveled: I assumed that if a home team did not have to leave the city for the next game, they still "traveled" 20 miles. I did this to eliminate some divisions by zero errors, and also to provide a slightly more realistic estimate of how much travel the players do before gametime. Furthermore, I assumed that if a team had three or more days off between games, they would be travelling to their next game from their home city, regardless of where they played their previous game. Distance between two cities was calculated using a simplified method.

The regression results can be found below. They show that the Days Rest and Day Rest^2 variables were significant in predicting the probability of a home team win, while the Distance Traveled variable was insignificant. The improvement in accuracy over the model using only Home and Away Pyth% was slight.

I wonder how much more effect Days Rest has at the individual player level rather than team level. Maybe certain players are affected more than others.

Still to come: an analysis of games played on neutral sites.

Code:
 MINITAB OUTPUT
Link Function: Logit

Response Information

Variable  Value  Count
HWIN      1       7036  (Event)
          0       4169
          Total  11205


Logistic Regression Table

                                                                 95% CI
Predictor        Coef    SE Coef       Z      P  Odds Ratio   Lower   Upper
Constant     0.567239   0.126594    4.48  0.000
HPYTH         4.96667   0.174392   28.48  0.000      143.55  101.99  202.04
APYTH        -4.89012   0.176459  -27.71  0.000        0.01    0.01   0.01
HDIST      -0.0000033  0.0000327   -0.10  0.920        1.00    1.00   1.00
ADIST       0.0000056  0.0000349    0.16  0.872        1.00    1.00   1.00
HREST        0.178717  0.0485314    3.68  0.000        1.20    1.09    1.32
AREST       -0.185518  0.0472785   -3.92  0.000        0.83    0.76    0.91
HREST_2    -0.0312151  0.0101463   -3.08  0.002        0.97    0.95    0.99
AREST_2     0.0314525  0.0114201    2.75  0.006        1.03    1.01    1.06


Log-Likelihood = -6532.321
Test that all slopes are zero: G = 1726.992, DF = 8, P-Value = 0.000

_________________
ed
Back to top
View user's profile Send private message Send e-mail
KnickerBlogger



Joined: 30 Dec 2004
Posts: 180

PostPosted: Fri Jan 14, 2005 10:12 am    Post subject: Reply with quote

Ed Kupfer wrote:
A perfect prediction model would lie right along the grey diagonal line. Both models work pretty well, but you can see that the logistic model is slightly better. If you'd like to use it to predict game outcomes, use the following equation:
Code:
 1 / (1 + EXP(-(0.59 + 4.4 * HOME% - 4.3 * AWAY%)))
where HOME% and AWAY% are some measures of home and away team strength, respectively -- I used Pythagorean projected win percentage in my calculations.


I'm a bit lost. Exactly what would you use for Home% or Away%? I understand Pyth win%, but I'm curious how you would figure that out for Home/Away?

In that equation, could you just substitute their home/road record? However isn't this a smaller sample size than their actual record, and varies more? Would it be possible to take their actual (or pyth) record and use the information that teams win 60% of the time at home?

Thanks,
Mike
Back to top
View user's profile Send private message Visit poster's website AIM Address Yahoo Messenger
Sam O



Joined: 14 Jan 2005
Posts: 5
Location: New York, NY

PostPosted: Fri Jan 14, 2005 10:45 am    Post subject: Reply with quote

I'm new to this and really enjoying this thread.

My interpretation of HOME% was simply the calculated pyg win% of the home team in question.

My question is how did you expand this formula
Code:

1 / (1 + EXP(-(0.59 + 4.4 * HOME% - 4.3 * AWAY%)))

to incorporate days rest and travel distance?

Thanks,
Sam
Back to top
View user's profile Send private message
KnickerBlogger



Joined: 30 Dec 2004
Posts: 180

PostPosted: Fri Jan 14, 2005 1:19 pm    Post subject: Reply with quote

Sam O wrote:
I'm new to this and really enjoying this thread.

My interpretation of HOME% was simply the calculated pyg win% of the home team in question.


Now it makes sense. Thanks!
Back to top
View user's profile Send private message Visit poster's website AIM Address Yahoo Messenger
Ed Küpfer



Joined: 30 Dec 2004
Posts: 787
Location: Toronto

PostPosted: Fri Jan 14, 2005 2:39 pm    Post subject: Reply with quote

KnickerBlogger: HOME% and AWAY% are merely estimates of the Home and Away team's strength. I used their Pythagorean win% up to that point in the season, but you can use anything: win-loss%, Gaussian%, whatever. Note that these numbers aren't adjusted for home court advantage, they just represent the strength of whatever team happens to be at home, and whatever team happens to be visiting. The equation attempts to answer the question, "What is the probability of a home team win, given team A at home and team B visiting?"

Sam O wrote:
My question is how did you expand this formula
Code:

1 / (1 + EXP(-(0.59 + 4.4 * HOME% - 4.3 * AWAY%)))

to incorporate days rest and travel distance?

If you know linear regression, binary logistic regression is pretty much the same thing, with a difference at the end. Linear regression puts the final answer in the form
Code:
ANSWER = constant + (b1 * x1) + (b2 * x2) + (b3 * x3).... (bn * xn)

where the b's represent the coefficients, and the x's represent the different variables. The coefficients are listed in the Minitab output I posted.

Logistic regression gives the final answer like this:
Code:
ANSWER = 1 / (1 + EXP(-( constant + (b1 * x1) + (b2 * x2) + (b3 * x3).... (bn * xn))))

(The EXP function raises e to the whatever power -- it should be built in to your spreadsheet or calculator.)

An Example

Picture it: January 28, 1997. Chicago visiting the Grizz. Vancouver was coming off a loss in Golden State the day before. The Bulls were coming off a home win to the Raptors on the 25th. What is the probability of a home Grizz win?


HPYTH -- Vancouver had scored 3955 that season going into the game, and given up 4471. Using the PythagoPat calculation described elsewhere (similar to Pythagorean%), the Grizz had a PYTH of 0.197.
APYTH -- The Bulls had scored 4306, given up 3793, giving them a PYTH of 0.809.
HDIST -- Vancouver travelled from Golden State to Vancouver = 948 miles.
ADIST -- Chicago travelled from Chicago to Vancouver = 2276 miles.
HREST -- Vancouver played the days before = 0 days rest.
AREST -- The Bulls had played three days before = 2 days rest.
HREST_2 -- (Home team rest)^2 = 0.
AREST_2 -- (Away team rest)^2 = 4.

Code:
Probability of Home Team Win = 1 / (1 + EXP(-( 0.567 + (4.97 * 0.197) + (-4.89 * 0.809) + (-0.0000033 * 948) +(0.0000056 * 2276) + (0.179 * 0) + (-0.186 * 2) + (-0.031 * 0) + (0.031 * 4))))

          = 0.066


As I said, the Distance Travelled variable was not statistically significant, so don't worry about including it in any calculations. Hope that all made sense.
_________________
ed
Back to top
View user's profile Send private message Send e-mail
Sam O



Joined: 14 Jan 2005
Posts: 5
Location: New York, NY

PostPosted: Fri Jan 14, 2005 2:49 pm    Post subject: Reply with quote

Wow. The example really helps. Thank you for a great answer.

Sam
Back to top
View user's profile Send private message
Dan Rosenbaum



Joined: 03 Jan 2005
Posts: 541
Location: Greensboro, North Carolina

PostPosted: Fri Jan 14, 2005 3:49 pm    Post subject: Reply with quote

Very nice work, Ed.

I was trying to get a sense of magnitude of the effects of rest days. Here is what I came up using your regressions results.

Holding all of the other variables constant at the mean, the effect of X days of rest for the home team relative to no days of rest is the given increase in the probability of the home team winning. Remember all of these are relative to no days of rest.

1 day: 3.4 percentage point increase
2 days: 5.4 percentage point increase
3 days: 6.0 percentage point increase
4 days: 5.0 percentage point increase
5 days: 2.6 percentage point increase
6 days: 1.2 percentage point decrease

One thing you may want to consider is a non-quadratic functional form for the days of rest variables. Perhaps you could put in dummy variables for one day of rest, two days of rest, three days of rest, four or more days of rest. You could do the same for both home and away rest days. With all of the data that you have, you probably can get precise estimates for all of these parameters.

And again, great work.

Best wishes,
Dan
Back to top
View user's profile Send private message Send e-mail Visit poster's website Yahoo Messenger
Ed Küpfer



Joined: 30 Dec 2004
Posts: 787
Location: Toronto

PostPosted: Fri Jan 14, 2005 4:45 pm    Post subject: Reply with quote

Dan Rosenbaum wrote:
One thing you may want to consider is a non-quadratic functional form for the days of rest variables. Perhaps you could put in dummy variables for one day of rest, two days of rest, three days of rest, four or more days of rest. You could do the same for both home and away rest days. With all of the data that you have, you probably can get precise estimates for all of these parameters.


Good idea. I don't know how precise they are, but the days rest have lost statistical significance by using each day as a dummy. I'm not sure how to interpret that. Here are the results:

Code:
Predictor        Coef        P
Constant     0.515206    0.257
HPYTH         4.97242    0.000
APYTH        -4.87942    0.000
HREST0      0.0727343    0.772
HREST1       0.302463    0.223
HREST2       0.302125    0.230
HREST3       0.285185    0.282
HREST4       0.291116    0.297
HREST5       0.725900    0.018
AREST0     -0.0366027    0.925
AREST1      -0.278646    0.475
AREST2      -0.288372    0.463
AREST3      0.0287516    0.944
AREST4      -0.264716    0.522
AREST5      -0.742079    0.082


The last digit indicates the number of days rest, the first character indicates home/away, eg HREST4 shows that the home team is playing on 4 days rest, AREST0 means the visitng team played the day before.
_________________
ed
Back to top
View user's profile Send private message Send e-mail
Dan Rosenbaum



Joined: 03 Jan 2005
Posts: 541
Location: Greensboro, North Carolina

PostPosted: Fri Jan 14, 2005 5:30 pm    Post subject: Reply with quote

The omitted group here must be six or more days of rest. Since you probably have so few of those observations, you are getting really low p-values. I would suggest leaving out HREST0 and AREST0 and redefining HREST4 and AREST4 the following way.

HREST4 = 1 if home team has four or more days of rest, 0 otherwise
AREST4 = 1 if away team has four or more days of rest, 0 otherwise

With this setup, you also leave out HREST5 and AREST5.

This way you will be comparing everything to zero days of rest.

In this particular sample where the dependent variable is equal to one about 63 percent of the time, the marginal effect evaluated at the mean is given by the parameter estimate times 0.234.

So a coefficient of 0.24 for your HREST3 variable would imply that holding the other variables constant, a home team with three days of rest has a 0.234*0.24 = 0.052 or 5.2 percentage points better chance of winning than a home team on zero days of rest.

0.234 is the approximate value of the PDF (probability density function) evaluated at the mean in this particular sample. For the example given above where the predicted probability of the home team winning was far less than 63 percent, the marginal effect would be smaller - the parameter estimate times 0.062.

The mutliplier is largest for predicted probability of 0.5 and smaller for predicted probabilities close to zero or one.
Back to top
View user's profile Send private message Send e-mail Visit poster's website Yahoo Messenger
Ed Küpfer



Joined: 30 Dec 2004
Posts: 787
Location: Toronto

PostPosted: Fri Jan 14, 2005 5:48 pm    Post subject: Reply with quote

Dan Rosenbaum wrote:
I would suggest leaving out HREST0 and AREST0 and redefining HREST4 and AREST4 the following way.

HREST4 = 1 if home team has four or more days of rest, 0 otherwise
AREST4 = 1 if away team has four or more days of rest, 0 otherwise

Done.
Code:
Predictor       Coef       P
Constant    0.550152   0.000
HPYTH        4.97621   0.000
APYTH       -4.88288   0.000
HREST1      0.231784   0.000
HREST2      0.231688   0.001
HREST3      0.213844   0.054
HREST4      0.248314   0.071
AREST1     -0.242966   0.000
AREST2     -0.252525   0.000
AREST3     0.0633113   0.647
AREST4     -0.301481   0.042


Hope you don't mind a stupid question.
Dan Rosenbaum wrote:
0.234 is the approximate value of the PDF (probability density function) evaluated at the mean in this particular sample. For the example given above where the predicted probability of the home team winning was far less than 63 percent, the marginal effect would be smaller - the parameter estimate times 0.062.

How do you get the PDF for different values?
_________________
ed
Back to top
View user's profile Send private message Send e-mail
Display posts from previous:   
Post new topic   Reply to topic    APBRmetrics Forum Index -> General discussion All times are GMT - 5 Hours
Goto page 1, 2  Next
Page 1 of 2

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum


Powered by phpBB © 2001, 2005 phpBB Group