|
APBRmetrics The statistical revolution will not be televised.
|
View previous topic :: View next topic |
Author |
Message |
DSMok1
Joined: 05 Aug 2009 Posts: 611 Location: Where the wind comes sweeping down the plains
|
Posted: Thu Aug 13, 2009 9:52 am Post subject: Toward an Adjusted Plus/Minus Projection System |
|
|
Toward an Adjusted Plus/Minus Projection System
I have not seen a comprehensive discussion of the issues related to adjusted plus/minus projection, and wished to begin here.
There are several things to consider:
1) What current measure/past measures of APM should be used?
2) To what mean should the APM measure be regressed to approximate true talent level?
3) How should the regression to the mean be conducted?
4) What aging curve should be used?
5) How do you account for rookies and players who played few minutes?
6) How do you best project minutes?
7) Assembling a roster/minute projections for next season
8 ) Combining team APM to provide a Pythag Win%
-------------------------------------
1) After looking around at various measures, it seemed best to me to use a 1 year stabilized APM, that provided by Ilardi here: http://sonicscentral.com/apbrmetrics/viewtopic.php?t=2294. The results are recent and well stabilized with past results.
2) Does one use the total NBA population to regress to? That seems rather simplistic. That's rather like regressing BABIP to the league mean in MLB--wrong. So how should the population be estimated? Comparable players? Or how about players with similar minutes played (because if a player is played a large number of minutes, a certain expectation on the part of the coach is implied--and he knows more about the situation). This is my initial system: regress toward a mean of similar players in minutes played...
To do this, I binned the players in small clusters and regressed an APM mean curve and standard deviation curve onto the minutes played. The APM to minutes played was a linear with r^2 of nearly 0.7. The standard deviation / minutes curve was parabolic with r^2 of about 0.27 (standard deviation isn't so stable...). This curve indicates that the players near average are relatively well-understood, while those on either end are less predictable (which is a reasonable conclusion). What this allows me to do is to apply the Bayesian Inference system.
The exact relations, based on players with over 400 minutes played:
Code: | Standard Deviation as a Function of Minutes:
y = 0.00000103904x^2 - 0.00354066717x + 6.11524654976
(r^2=0.27472271151)
Adjusted Plus Minus as a Function of Minutes:
y = 0.00240591x - 4.73140989
(r^2 = 0.67449244)
|
3) See http://sonicscentral.com/apbrmetrics/viewtopic.php?p=27536#27536 for my initial work... It appears the Bayesian Statistics system (see http://www.scholarpedia.org/article/Bayesian_statistics) is by far the best way to model "regression to the mean," for it applies rigorous statistical theory to the problem. Thus, an appropriate estimate for true talent level may be obtained using the relation:
Code: | Creating our best estimate for the true level of a statistic yields:
(When using our initial estimate and adding our likelihood function based on observation.)
APM*(1/StdErr^2)+MeanPop*(1/StDev^2)
TrueTalent = ----------------------------------------
(1/StdErr^2)+(1/StDev^2)
Where
TrueTalent = our best estimate of the actual APM
APM = our observed adjusted plus/minus
StdErr = the standard error of our observed APM
MeanPop = the mean APM of the population of which our player is a member
StDev = the standard deviation of the APM of the popullation of which our player is a member
If the Mean Pop = 0 (like when using the total NBA population's mean APM) then the equation simplifies to:
APM*(1/StdErr^2)
TrueTalent = ----------------------------
(1/StdErr^2)+(1/StDev^2)
|
4) There is an aging curve for APM at http://sonicscentral.com/apbrmetrics/viewtopic.php?t=137. This is simply added onto the True Talent Estimate.
5) How does one account for rookies? This is the hardest part of the projection--I suspect a projection based on history, creating a curve relating to pick number in the draft, may be about the best that can be done easily. Perhaps modified by age (actually, I'm sure it should be modified by age). For players who did not play recently, perhaps the most recent year with data (then regressed and adjusted for age changes) could suffice... or the projection based on 2) of a player with 250 minutes or so.
6) This is always interesting... and very tricky. I suspect another Bayesian liklihood analysis is in order. We can calculate the estimated minutes for a player of a given APM just like I did in 2). However, how does one estimate a "Standard Error" for our minutes played number? I think to do that it would be best to look at minutes per game over the season, calculate the standard error, and translate that to minutes per year. That would take a long time to do for each NBA player, however--but it would properly account for injuries! Then, a True Minute Level estimate could be calculated just the same way as I outlined for APM.
7) The True Minute Level estimate must be scaled so the total team minutes played =82*48*5=19680... On teams with players that played few minutes, some sort of modification so no player goes over about 3000 minutes projected must be done. In other words, the scaling is not linear. I have not solved how to scale properly yet... has anyone else?
8 ) Finally, something easy--use the %playing time for each player, add up the adjustments (per 100 possessions), and use your favorite Pythagorean formula to create an expected winning percentage.
Thoughts? |
|
Back to top |
|
|
Neil Paine
Joined: 13 Oct 2005 Posts: 774 Location: Atlanta, GA
|
Posted: Thu Aug 13, 2009 10:33 am Post subject: |
|
|
We've actually done some work like this at BBR, except with Statistical +/- (not pure Adj. +/-). The same basic tenets should hold true for both, though, since both are attempting to measure the same thing. |
|
Back to top |
|
|
Ryan J. Parker
Joined: 23 Mar 2007 Posts: 711 Location: Raleigh, NC
|
Posted: Thu Aug 13, 2009 11:04 am Post subject: |
|
|
I would like to see an emphasis on predicting team efficiency where we know the following information for each "shift":
- Players on the court
- Which lineup is at home
- The quarter and time remaining
- The starting margin of victory of the "shift"
- Other pertinent info we might know ahead of time The general idea is that players get injured or don't get played for whatever reason. I think a more interesting analysis comes in not actually trying to predict this sort of thing, and rather use previous season data to forecast what would happen if we knew which players would be on the court, and what situations these players would be in.
I'm not suggesting future season predictions aren't important, but there will be a lot of extra noise in these predictions when we really want to focus on a model strictly for predicting points scored by competing lineups.
In any event, I believe this would allow us to properly gauge how well a given model performs in predicting team efficiency. Thus even though we go in having measured (guessed?) what we think are the best predictors of future lineup efficiency, we can now actually test and see where these models do and do not perform well. We can examine low minute players, or situations where a player was given a new home, etc. _________________ I am a basketball geek. |
|
Back to top |
|
|
Crow
Joined: 20 Jan 2009 Posts: 825
|
Posted: Thu Aug 13, 2009 12:07 pm Post subject: |
|
|
Instead of just comparable players by minutes what about by minutes and whether PG. wing or big; or one of the existing similarity systems based on discrete stats and demographics or statistical +/- or a hybrid system of all? |
|
Back to top |
|
|
DSMok1
Joined: 05 Aug 2009 Posts: 611 Location: Where the wind comes sweeping down the plains
|
Posted: Sat Sep 12, 2009 12:07 pm Post subject: |
|
|
I have delved further into the world of Bayesian statistics.
The basic Bayesian model is here:
Code: |
P(A|B)=[P(B|A)*P(A)]/[P(B)]
P(A|B) is the odds of A given B--the probability desired.
P(A) is the prior probability of A. This is the standard distribution curve of NBA athletes.
P(B|A) is the conditional probability of B given A, or (and this is where I was wrong) the standard distribution curve of the player (using standard error for the standard deviation of the curve).
P(B) NORMALIZES the distribution curve so the cumulative value under the curve equals 1. (I was wrong in saying it could be neglected)
|
And here are the equations that give the new curve created by that relation:
Code: |
Creating our best estimate for the true level of a statistic yields:
(When using our initial estimate and adding our likelihood function based on observation.)
APM*(1/StdErr^2)+MeanPop*(1/StDev^2)
TrueTalent = ----------------------------------------
(1/StdErr^2)+(1/StDev^2)
Where
TrueTalent = our best estimate of the actual APM
APM = our observed adjusted plus/minus
StdErr = the standard error of our observed APM
MeanPop = the mean APM of the population of which our player is a member
StDev = the standard deviation of the APM of the population of which our player is a member
And the new standard error (for this best estimate) is
1
StdErrTT = -------------------------------------
---------------------------
/ 1 1
_ / ---------- + ----------
\/ (StdErr^2) (StDev^2)
|
I might add, these can be expanded. For instance, when multiplying a whole series of normal distributions, just add the extra terms in the equations above.
So this is the comprehensive model.
How can it be implemented, however? How does one get the standard error terms?
Well, for percentages, the standard error may be determined using a binomial distribution equation. So, if one has, say, 500 trials and the estimated likelihood of the event happening is 0.3, the standard error is:
Code: |
Binomial Standard Error:
----------------
BinomialStdErr = / (p)*(1-p)
_ / -----------
\/ N
Where
p = measured probability of event happening
N = number of trials
BinomialStdErr = 0.0205
|
This binomial standard error works only when measuring events with only two outcomes (success or failure), though perhaps with modification it can be extended to slightly more complex cases (see Ed Küpfer's discussion of TS% and this type of method ).
Otherwise, more complex analyses are in order. If one doesn't have access to the original data (the rebounds in each game for each person, for instance) then an approximation may have to be used. I did an analysis of adjusted yards per attempt in football--but all I had was the year-end stats for each player. To estimate the standard error associated with each player, I looked at all passing plays in 2005 (from the free database) and calculated the standard deviation of the play results. Then I used the definition of standard error to estimate the standard error associated with each QB's stats. Example:
Code: |
StDev
StdErr = ---------
sqrt(N)
15 (Stdev of yards per attempt, counting -60 for INts)
StdErr = ---------
sqrt(635) (Brees)
StdErr = 0.595
|
Okay, so we have a distribution of similar players to regress to and we have the player's stats and standard error. What's left? Aging and year-to-year effects.
To translate from one year to the next, we take the curve for one year and multiply (or add, I suppose) a transformation distribution. It could be 1 +/- some standard deviation. It could be more than or less than 1, depending on the aging curve being used. To generate this error distribution, there are several methods. Year over year of same players with some minimum playing time is a common method. In generating that distribution, it is best to use the already-regressed values for each year to account for likely error in the measurement. This will yield a distribution with a mean very near 1 and a standard deviation that could be, say, 0.15.
To apply this year-to-year transformation, a simple multiplication of the distribution curves is used--see Propagation of Uncertainty:
Code: |
FinalMean = ValMean * TransMean
The mean is simply the product of the mean of Value and the mean of Transformation.
There is no covariance.
--------------------------------
/ 2 2
FinalStdErr = ValMean * TransMean * / / ValDev \ / TransDev \
_ / | ------- | + | --------- |
\/ \ ValMean / \ TransMean /
|
So to generate a projection, one would create a number of standard distribution curves:
1. The "prior" or similar population distribution
2. The previous year distribution, untransformed (not regressed to the mean)
3. Each prior year available (not regressed), multiplied by the transformation however many years to reach to the previous year
Then these 3 would be combined all at once with the Bayesian formulas above, effectively regressing to the mean and yielding the best estimate for the "true talent" at the end of the previous year. Then the transformation to this year could be applied (aging and variance expected).
There are alternatives to this order of assembly--one could regress each season first and not use the "prior" explicitly here, but the errors don't quite work right doing it in that order--the players will very little data end up having the league-wide distribution as their distribution.
It looks like I am now ready to create projections of adjusted +/- for next year--what's more, there will be the appropriate standard distribution of the expectation so updates throughout the year could be applied (providing they have accurate error intervals).
Thanks to all of the sites that had info used in compiling this--this forum, Count the Basket, Tango and MGL over at THE BOOK, The Hardball Times except they were wrong often, and any others. I used MathCAD and Excel for deriving and checking everything.
Please, find any errors and correct my formulation before I set off on the projections! (Of course, to actually get team projections I must project the minutes--which is even more tricky.) |
|
Back to top |
|
|
Crow
Joined: 20 Jan 2009 Posts: 825
|
Posted: Sat Sep 12, 2009 1:56 pm Post subject: |
|
|
I hope you get even more feedback from those experienced with the calculations. But best wishes / full speed ahead either way.
"1) What current measure/past measures of APM should be used?
This is important and tough question. 1 year stabilized is what I am mainly using but 6 year average in good too. I don't know... what about something like 2/3 thirds 1 year stabilized, 1/3rd 6 year average (or just do the math to find the resulting year weights)? That might be "better".
2) To what mean should the APM measure be regressed to approximate true talent level?
I think this should be position specific at least to the level of PG, wing, big. Whether it should be done for starters/big minutes and subs I don't know but I'd at least consider it for a moment.
4) What aging curve should be used?
I think it should be specific to at least the 3 size/role groups I just suggested above.
5) How do you account for rookies and players who played few minutes?
For rookies I think some combination of draft position and college stats might be better than a 1 assumption fits all approach. For players who played few minutes I might adjust the values by team by quality of that team and system to some degree rather than treat all across the league exactly the same.
6) How do you best project minutes?
I think subjective assessment directly or after a experienced based algorithm could get closest but haven't explicitly tried it.
7) Assembling a roster/minute projections for next season
Ideally I'd think you'd want to try to get at time for 5 man lineups though this will be very difficult given how little obvious logic is being applied.
8 ) Combining team APM to provide a Pythag Win%
I haven't seen anyone report much detail about the covariances. Is it time? If you did, what do you do next? Can you usefully adjust further or just offer a blanket qualifier about the meaning of what you have?
I think the step from list of players on the court to unit production is key. If you had player pair (with and without) and perimeter and interior sub-unit adjusteds from last season maybe you could better estimate how player APM will actually combine into lineup APM and roll-up into team APM. At least better than essentially treating every context as the same for players when lots agree they are not. I don't know if you want to do this or share it but it would be great if you did. Maybe in an ultimate implementation you run all the possibilities thru a Monte Carlo simulator to guide the final choice. |
|
Back to top |
|
|
|
|
You cannot post new topics in this forum You cannot reply to topics in this forum You cannot edit your posts in this forum You cannot delete your posts in this forum You cannot vote in polls in this forum
|
Powered by phpBB © 2001, 2005 phpBB Group
|