|
APBRmetrics The statistical revolution will not be televised.
|
View previous topic :: View next topic |
Author |
Message |
Ed Küpfer
Joined: 30 Dec 2004 Posts: 787 Location: Toronto
|
Posted: Thu Jan 06, 2005 9:34 pm Post subject: Predicting the future |
|
|
I hope to add some thoughts to this thread at irregular intervals.
To predict the outcome of a game, Bill James's "log5" method has been used traditionally:
Code: | Probability of Team A beating Team B = [(Team A win%) * (1 - Team B Win%)] /
[(Team A win%) * (1 - Team B Win%) +
(1 - Team A win%) * (Team B Win%)] |
If Team A is the home team, an adjustment for home court can be made by using the following:
Code: | Probability of Team A beating Team B =
[(Team A win%) * (1 - Team B Win%) * HCA] /
[(Team A win%) * (1 - Team B Win%) * HCA +
(1 - Team A win%) * (Team B Win%) * (1 - HCA)] |
where HCA = home court advantage, the percentage of games won by home teams -- in the NBA, usually about 0.6.
I took a large sample of regular games from 1974 and 2004. I removed games in which one of the two teams had played fewer than 40 games in that season, or played more than 70 -- this make sure both teams had a sufficient number of games to establish their ability, and to eliminate games towards the end of the season in which playoff-bound teams rested starters or teams already eliminated experimented with style of play or lineups. I also eliminated games played on neutral turf. This gave me a sample of 10,892 games.
For each game I calculated the probability of a home team win using the log5 method outlined above, with HCA = 0.6284 (the HCA in my sample). Instead of a team's actual win%, I used their Pythagorean Win% (exponent 14) at the date of the game. That is, I calculated each team's Pyth% using their points scored and allowed in games prior to the game being predicted.
I grouped these predictions into bins centered at 10% intervals, ie the "20%" bin shown below contains predictions between 15%-25%. I then calculated the actual percentage of home team wins for each bin. Here's how it looks:
Code: | Predicted Won Lost %
0.0% 0 0 0.0%
10.0% 13 49 21.0%
20.0% 74 249 22.9%
30.0% 247 484 33.8%
40.0% 517 685 43.0%
50.0% 842 777 52.0%
60.0% 1295 790 62.1%
70.0% 1542 592 72.3%
80.0% 1488 343 81.3%
90.0% 798 76 91.3%
100.0% 29 2 93.5% |
A system which predicted perfectly would have identical 1st and 5th columns. This one does pretty good, except for the extremes, where it breaks down. Also, it overestimates slightly at every category, for reasons which elude me.
This is what the data above look like graphically:
The diagonal line respresents perfect predictions. If anyone has any ideas why log5 consistently overestimates the probability of home wins, I'd like to hear it.
Coming soon will be a comparison of log5 to a prediction system built from binary logistic regression. _________________ ed |
|
Back to top |
|
|
kjb
Joined: 03 Jan 2005 Posts: 865 Location: Washington, DC
|
Posted: Fri Jan 07, 2005 11:06 am Post subject: |
|
|
How does it do with predicting victory margins? |
|
Back to top |
|
|
Dan Rosenbaum
Joined: 03 Jan 2005 Posts: 541 Location: Greensboro, North Carolina
|
Posted: Fri Jan 07, 2005 12:56 pm Post subject: |
|
|
This is very interesting work, ed. Thanks! I bet injuries and suspensions are the explanation for why the formula does poorly on the tails. In those cases the predictions are based upon players who are not playing in that given game, so it would not be surprising for the predictions to be off. For that reason I would not be too aggressive in pursuing other functional forms that would predict more wins in the tails. They may fit the data better but for the wrong reasons. |
|
Back to top |
|
|
Ed Küpfer
Joined: 30 Dec 2004 Posts: 787 Location: Toronto
|
Posted: Fri Jan 07, 2005 1:10 pm Post subject: |
|
|
WizardsKev: I don't see why the variables can't be regressed against the win margin instead of simply the win/loss binary outcome. We've seen from the Pythagorean method that win margins tell us something real about the quality of the teams, so this might be something worth persuing. I am in the middle of assembling more data to use as variables (rest days and distance traveled) so it will have to wait a few days.
Dan Rosenbaum: Thanks for the input. You're probably right that I shouldn't waste time on the results of extreme predictions. In the next day or two I'll post a logit model for game predictions, so maybe these extreme predictions will go away! _________________ ed |
|
Back to top |
|
|
Ed Küpfer
Joined: 30 Dec 2004 Posts: 787 Location: Toronto
|
Posted: Fri Jan 07, 2005 4:34 pm Post subject: |
|
|
Using the same game data as above, I used Minitab's binary logistic function to model the outcomes. logistic regression is basically the same type of thing as linear regression, except that while linear regression is used on dependant variables that are continuous (like height, or points per game), logistic regression is used on outcomes that are of the win/loss, yes/no, hit/miss variety. My dependant variable was HomeWin (1=yes, 0=no). I'll append the Minitab output at the bottom of this post.
To show how well the logistic model predicted game outcomes in comparison to the log5 method mentioned previously, take a look at the following graph:
A perfect prediction model would lie right along the grey diagonal line. Both models work pretty well, but you can see that the logistic model is slightly better. If you'd like to use it to predict game outcomes, use the following equation:
Code: | 1 / (1 + EXP(-(0.59 + 4.4 * HOME% - 4.3 * AWAY%))) | where HOME% and AWAY% are some measures of home and away team strength, respectively -- I used Pythagorean projected win percentage in my calculations.
Up next: refining the model by adding days rest and travel distance variables.
Code: | MINITAB OUTPUT:
Binary Logistic Regression: HomeW versus HPyth, APyth
Link Function: Logit
Response Information
Variable Value Count
HomeW 1 6845 (Event)
0 4047
Total 10892
Logistic Regression Table
Odds 95% CI
Predictor Coef SE Coef Z P Ratio Lower Upper
Constant 0.594422 0.108072 5.50 0.000
HPyth 4.37510 0.158557 27.59 0.000 79.45 58.23 108.40
APyth -4.29649 0.160249 -26.81 0.000 0.01 0.01 0.02
Log-Likelihood = -6386.726
Test that all slopes are zero: G = 1599.178, DF = 2, P-Value = 0.000 |
_________________ ed |
|
Back to top |
|
|
Ed Küpfer
Joined: 30 Dec 2004 Posts: 787 Location: Toronto
|
Posted: Fri Jan 14, 2005 4:19 am Post subject: |
|
|
Similar to the analysis above: binary logistic regression, using the following variables:
Code: | HPYTH Home team Pythagorean win%
APYTH Away team Pythagorean win%
HREST Home team rest days before game
AREST Away team rest days before game
HREST_2 (Home team rest days before game)^2
AREST_2 (Away team rest days before game)^2
HDIST Home team distance traveled to game
ADIST Away team distance traveled to game |
(No graphs this time. Sorry, gang.)
A new data set was used, consisting of 11,206 games played between 1974 and 2004. Only games in which both teams had played a) 40 or more games in the season, and b) 70 or fewer games in the season. Additionally, games played at a neutral site were eliminated from the dataset.
For Distance Traveled: I assumed that if a home team did not have to leave the city for the next game, they still "traveled" 20 miles. I did this to eliminate some divisions by zero errors, and also to provide a slightly more realistic estimate of how much travel the players do before gametime. Furthermore, I assumed that if a team had three or more days off between games, they would be travelling to their next game from their home city, regardless of where they played their previous game. Distance between two cities was calculated using a simplified method.
The regression results can be found below. They show that the Days Rest and Day Rest^2 variables were significant in predicting the probability of a home team win, while the Distance Traveled variable was insignificant. The improvement in accuracy over the model using only Home and Away Pyth% was slight.
I wonder how much more effect Days Rest has at the individual player level rather than team level. Maybe certain players are affected more than others.
Still to come: an analysis of games played on neutral sites.
Code: | MINITAB OUTPUT
Link Function: Logit
Response Information
Variable Value Count
HWIN 1 7036 (Event)
0 4169
Total 11205
Logistic Regression Table
95% CI
Predictor Coef SE Coef Z P Odds Ratio Lower Upper
Constant 0.567239 0.126594 4.48 0.000
HPYTH 4.96667 0.174392 28.48 0.000 143.55 101.99 202.04
APYTH -4.89012 0.176459 -27.71 0.000 0.01 0.01 0.01
HDIST -0.0000033 0.0000327 -0.10 0.920 1.00 1.00 1.00
ADIST 0.0000056 0.0000349 0.16 0.872 1.00 1.00 1.00
HREST 0.178717 0.0485314 3.68 0.000 1.20 1.09 1.32
AREST -0.185518 0.0472785 -3.92 0.000 0.83 0.76 0.91
HREST_2 -0.0312151 0.0101463 -3.08 0.002 0.97 0.95 0.99
AREST_2 0.0314525 0.0114201 2.75 0.006 1.03 1.01 1.06
Log-Likelihood = -6532.321
Test that all slopes are zero: G = 1726.992, DF = 8, P-Value = 0.000 |
_________________ ed |
|
Back to top |
|
|
KnickerBlogger
Joined: 30 Dec 2004 Posts: 180
|
Posted: Fri Jan 14, 2005 10:12 am Post subject: |
|
|
Ed Kupfer wrote: | A perfect prediction model would lie right along the grey diagonal line. Both models work pretty well, but you can see that the logistic model is slightly better. If you'd like to use it to predict game outcomes, use the following equation:
Code: | 1 / (1 + EXP(-(0.59 + 4.4 * HOME% - 4.3 * AWAY%))) | where HOME% and AWAY% are some measures of home and away team strength, respectively -- I used Pythagorean projected win percentage in my calculations.
|
I'm a bit lost. Exactly what would you use for Home% or Away%? I understand Pyth win%, but I'm curious how you would figure that out for Home/Away?
In that equation, could you just substitute their home/road record? However isn't this a smaller sample size than their actual record, and varies more? Would it be possible to take their actual (or pyth) record and use the information that teams win 60% of the time at home?
Thanks,
Mike |
|
Back to top |
|
|
Sam O
Joined: 14 Jan 2005 Posts: 5 Location: New York, NY
|
Posted: Fri Jan 14, 2005 10:45 am Post subject: |
|
|
I'm new to this and really enjoying this thread.
My interpretation of HOME% was simply the calculated pyg win% of the home team in question.
My question is how did you expand this formula
Code: |
1 / (1 + EXP(-(0.59 + 4.4 * HOME% - 4.3 * AWAY%)))
|
to incorporate days rest and travel distance?
Thanks,
Sam |
|
Back to top |
|
|
KnickerBlogger
Joined: 30 Dec 2004 Posts: 180
|
Posted: Fri Jan 14, 2005 1:19 pm Post subject: |
|
|
Sam O wrote: | I'm new to this and really enjoying this thread.
My interpretation of HOME% was simply the calculated pyg win% of the home team in question.
|
Now it makes sense. Thanks! |
|
Back to top |
|
|
Ed Küpfer
Joined: 30 Dec 2004 Posts: 787 Location: Toronto
|
Posted: Fri Jan 14, 2005 2:39 pm Post subject: |
|
|
KnickerBlogger: HOME% and AWAY% are merely estimates of the Home and Away team's strength. I used their Pythagorean win% up to that point in the season, but you can use anything: win-loss%, Gaussian%, whatever. Note that these numbers aren't adjusted for home court advantage, they just represent the strength of whatever team happens to be at home, and whatever team happens to be visiting. The equation attempts to answer the question, "What is the probability of a home team win, given team A at home and team B visiting?"
Sam O wrote: | My question is how did you expand this formula
Code: |
1 / (1 + EXP(-(0.59 + 4.4 * HOME% - 4.3 * AWAY%)))
|
to incorporate days rest and travel distance? |
If you know linear regression, binary logistic regression is pretty much the same thing, with a difference at the end. Linear regression puts the final answer in the form
Code: | ANSWER = constant + (b1 * x1) + (b2 * x2) + (b3 * x3).... (bn * xn) |
where the b's represent the coefficients, and the x's represent the different variables. The coefficients are listed in the Minitab output I posted.
Logistic regression gives the final answer like this:
Code: | ANSWER = 1 / (1 + EXP(-( constant + (b1 * x1) + (b2 * x2) + (b3 * x3).... (bn * xn)))) |
(The EXP function raises e to the whatever power -- it should be built in to your spreadsheet or calculator.)
An Example
Picture it: January 28, 1997. Chicago visiting the Grizz. Vancouver was coming off a loss in Golden State the day before. The Bulls were coming off a home win to the Raptors on the 25th. What is the probability of a home Grizz win?
HPYTH -- Vancouver had scored 3955 that season going into the game, and given up 4471. Using the PythagoPat calculation described elsewhere (similar to Pythagorean%), the Grizz had a PYTH of 0.197.
APYTH -- The Bulls had scored 4306, given up 3793, giving them a PYTH of 0.809.
HDIST -- Vancouver travelled from Golden State to Vancouver = 948 miles.
ADIST -- Chicago travelled from Chicago to Vancouver = 2276 miles.
HREST -- Vancouver played the days before = 0 days rest.
AREST -- The Bulls had played three days before = 2 days rest.
HREST_2 -- (Home team rest)^2 = 0.
AREST_2 -- (Away team rest)^2 = 4.
Code: | Probability of Home Team Win = 1 / (1 + EXP(-( 0.567 + (4.97 * 0.197) + (-4.89 * 0.809) + (-0.0000033 * 948) +(0.0000056 * 2276) + (0.179 * 0) + (-0.186 * 2) + (-0.031 * 0) + (0.031 * 4))))
= 0.066 |
As I said, the Distance Travelled variable was not statistically significant, so don't worry about including it in any calculations. Hope that all made sense. _________________ ed |
|
Back to top |
|
|
Sam O
Joined: 14 Jan 2005 Posts: 5 Location: New York, NY
|
Posted: Fri Jan 14, 2005 2:49 pm Post subject: |
|
|
Wow. The example really helps. Thank you for a great answer.
Sam |
|
Back to top |
|
|
Dan Rosenbaum
Joined: 03 Jan 2005 Posts: 541 Location: Greensboro, North Carolina
|
Posted: Fri Jan 14, 2005 3:49 pm Post subject: |
|
|
Very nice work, Ed.
I was trying to get a sense of magnitude of the effects of rest days. Here is what I came up using your regressions results.
Holding all of the other variables constant at the mean, the effect of X days of rest for the home team relative to no days of rest is the given increase in the probability of the home team winning. Remember all of these are relative to no days of rest.
1 day: 3.4 percentage point increase
2 days: 5.4 percentage point increase
3 days: 6.0 percentage point increase
4 days: 5.0 percentage point increase
5 days: 2.6 percentage point increase
6 days: 1.2 percentage point decrease
One thing you may want to consider is a non-quadratic functional form for the days of rest variables. Perhaps you could put in dummy variables for one day of rest, two days of rest, three days of rest, four or more days of rest. You could do the same for both home and away rest days. With all of the data that you have, you probably can get precise estimates for all of these parameters.
And again, great work.
Best wishes,
Dan |
|
Back to top |
|
|
Ed Küpfer
Joined: 30 Dec 2004 Posts: 787 Location: Toronto
|
Posted: Fri Jan 14, 2005 4:45 pm Post subject: |
|
|
Dan Rosenbaum wrote: | One thing you may want to consider is a non-quadratic functional form for the days of rest variables. Perhaps you could put in dummy variables for one day of rest, two days of rest, three days of rest, four or more days of rest. You could do the same for both home and away rest days. With all of the data that you have, you probably can get precise estimates for all of these parameters. |
Good idea. I don't know how precise they are, but the days rest have lost statistical significance by using each day as a dummy. I'm not sure how to interpret that. Here are the results:
Code: | Predictor Coef P
Constant 0.515206 0.257
HPYTH 4.97242 0.000
APYTH -4.87942 0.000
HREST0 0.0727343 0.772
HREST1 0.302463 0.223
HREST2 0.302125 0.230
HREST3 0.285185 0.282
HREST4 0.291116 0.297
HREST5 0.725900 0.018
AREST0 -0.0366027 0.925
AREST1 -0.278646 0.475
AREST2 -0.288372 0.463
AREST3 0.0287516 0.944
AREST4 -0.264716 0.522
AREST5 -0.742079 0.082 |
The last digit indicates the number of days rest, the first character indicates home/away, eg HREST4 shows that the home team is playing on 4 days rest, AREST0 means the visitng team played the day before. _________________ ed |
|
Back to top |
|
|
Dan Rosenbaum
Joined: 03 Jan 2005 Posts: 541 Location: Greensboro, North Carolina
|
Posted: Fri Jan 14, 2005 5:30 pm Post subject: |
|
|
The omitted group here must be six or more days of rest. Since you probably have so few of those observations, you are getting really low p-values. I would suggest leaving out HREST0 and AREST0 and redefining HREST4 and AREST4 the following way.
HREST4 = 1 if home team has four or more days of rest, 0 otherwise
AREST4 = 1 if away team has four or more days of rest, 0 otherwise
With this setup, you also leave out HREST5 and AREST5.
This way you will be comparing everything to zero days of rest.
In this particular sample where the dependent variable is equal to one about 63 percent of the time, the marginal effect evaluated at the mean is given by the parameter estimate times 0.234.
So a coefficient of 0.24 for your HREST3 variable would imply that holding the other variables constant, a home team with three days of rest has a 0.234*0.24 = 0.052 or 5.2 percentage points better chance of winning than a home team on zero days of rest.
0.234 is the approximate value of the PDF (probability density function) evaluated at the mean in this particular sample. For the example given above where the predicted probability of the home team winning was far less than 63 percent, the marginal effect would be smaller - the parameter estimate times 0.062.
The mutliplier is largest for predicted probability of 0.5 and smaller for predicted probabilities close to zero or one. |
|
Back to top |
|
|
Ed Küpfer
Joined: 30 Dec 2004 Posts: 787 Location: Toronto
|
Posted: Fri Jan 14, 2005 5:48 pm Post subject: |
|
|
Dan Rosenbaum wrote: | I would suggest leaving out HREST0 and AREST0 and redefining HREST4 and AREST4 the following way.
HREST4 = 1 if home team has four or more days of rest, 0 otherwise
AREST4 = 1 if away team has four or more days of rest, 0 otherwise |
Done.
Code: | Predictor Coef P
Constant 0.550152 0.000
HPYTH 4.97621 0.000
APYTH -4.88288 0.000
HREST1 0.231784 0.000
HREST2 0.231688 0.001
HREST3 0.213844 0.054
HREST4 0.248314 0.071
AREST1 -0.242966 0.000
AREST2 -0.252525 0.000
AREST3 0.0633113 0.647
AREST4 -0.301481 0.042 |
Hope you don't mind a stupid question.
Dan Rosenbaum wrote: | 0.234 is the approximate value of the PDF (probability density function) evaluated at the mean in this particular sample. For the example given above where the predicted probability of the home team winning was far less than 63 percent, the marginal effect would be smaller - the parameter estimate times 0.062. |
How do you get the PDF for different values? _________________ ed |
|
Back to top |
|
|
|
|
You cannot post new topics in this forum You cannot reply to topics in this forum You cannot edit your posts in this forum You cannot delete your posts in this forum You cannot vote in polls in this forum
|
Powered by phpBB © 2001, 2005 phpBB Group
|