APBRmetrics Forum Index APBRmetrics
The statistical revolution will not be televised.
 
 FAQFAQ   SearchSearch   MemberlistMemberlist   UsergroupsUsergroups   RegisterRegister 
 ProfileProfile   Log in to check your private messagesLog in to check your private messages   Log inLog in 

Regressing to the mean

 
Post new topic   Reply to topic    APBRmetrics Forum Index -> General discussion
View previous topic :: View next topic  
Author Message
Ed Küpfer



Joined: 30 Dec 2004
Posts: 647
Location: Toronto

PostPosted: Fri Mar 17, 2006 12:21 am    Post subject: Regressing to the mean Reply with quote

Over the first three games of the 2002 season, Shaq hit 34 of 48 from the free throw line (71%), including a game where he went 16 for 18. What did the coaching staff think? Did they rejoice, believing that Shaq had finally mastered the art of free throw shooting after years of indifference? Or did they think "fluke" and wait for the inevitable return to 50% form? What would you have thought?

At that point in his career, Shaq had been 3393-6390 (53%), having shot a high of 59% as a rookie and a low of 48% in his fifth season.

Code:
YEAR   FTM   FTA     FT%
1993   427   721   59.2%
1994   471   850   55.4%
1995   455   854   53.3%
1996   249   511   48.7%
1997   232   479   48.4%
1998   359   681   52.7%
1999   269   498   54.0%
2000   432   824   52.4%
2001   499   972   51.3%

TOTAL 3393  6390   53.1%


What I want to do is show you a method of forecasting how well Shaq will shoot for the following season, called regression to the mean. This method is in the appendix to The Book, by Tango, Lichtman, and Dolphin. I cannot tell you how much I learned from reading it. I'll try to present their method correctly, although I've had to modify a few things, since they were writing about baseball. Any mistakes are completely mine.

First of all, we recognize that the 48 FTAs in 2002 is really a small sample, especially in comparison to the 6000+ FTAs Shaq had taken until that point in his career. Those 6000+ shots include real information about Shaq's ability to hit a FT, but how much information? And how should we merge the information gathered over his first 48 shots in 2002 with what we know about Shaq's overall FT% ability? (Also note that there's no trend to Shaq's FT%, so we can be reasonably sure that his overall numbers are still useful to us.)

I'll give you the formula for regressing first, and explain later:

Code:
Regressed FT% = [ObservedFT%/Variance(ObservedFT%) + MeanFT%/Variance(MeanFT%)]
                 /   [ 1/Variance(ObservedFT%) + 1/Variance(MeanFT%)]

ObservedFT% is the 71% that Shaq shot at the beginning of the season. From now on, I'll refer to Shaq's 34-48, 71% ObservedFT% simply as FT%.

The variance of FT% is calcuated by

Code:
VAR = FT%*(1 - FT%)/FTA
    = 71%*(1 - 71%)/48
    =  0.43%


MeanFT% is a reference number that we believe better represents the true FT% ability of the player. It can be the league-wide average FT%, or it can be Shaq's most recent season, but in this case, we'll use his career FT%. If we were talking about a player who hadn't been in the NBA long, we'd probably regress to the league-wide mean, but with Shaq, we have enough history to know that 53% probably represents his ability pretty well. I called it here MeanFT% because of the "regression to the mean" title of the post, but really, if you simply think of "mean" as a reference number to which we're regressing, you'll save yourself some confusion.

The Variance(MeanFT%) term is not, as you may think, the counterpart to Variance(ObservedFT%), ie FT%*(1-FT%)/FTA. What this term is intended to represent is the distribution of true MeanFT% FT shooting ability. If we were using league-wide FT% as the MeanFT%, we'd be looking for the variance of true FT shooting ability among all players. We want their actual shooting ability, not their observed ability. A player who shoots 1-1, or 10-10, or even 100-100 is not a true 100% free throw shooter. He still has a nonzero chance of missing

[A word on "true" ability. It is an ideal, abstract concept. It assumes that the ability does not change at all during the time period in question. It assumes that each trial is identical, and neutral with respect to the probability of the player succeeding—for example, it ignores quality of opponent, injuries, and other pressures. These fictions are extremely useful, however. And I believe they are more sound than the alternative, that a player's ability to hit a shot is changing constantly, indeed that it is different from shot attempt to shot attempt. In any case, the latter gives us nothing to do, nowhere to go —stats wise—and so we go along with the idea that a player's ability doesn't change is a constant, at least for a while.]

So how do we calculate the variance of the true FT% ability among our reference category (in this case, Shaq's career FT%)? The easiest way is to simulate every FTA. I won't describe the simulations, because there is a better, more theoretically sound method, but I always simulate these things, just to make sure my numbers are correct. Here's how to do it mathematically. I'm warning you, it's complicated.

Code:
{A} = (FT% - MeanFT%)^2 - FT%*(1-FT%)/FTA


The first part, as you can see, is the squared deviation of Shaq's FT% from the reference. The second part is the variance of the FT%.

Code:
{B} = 2*(FT%*(1-FT%)/FTA + FUDGE)^2


There you see the FT% variance again, but what the hell is FUDGE? Tell you the truth, I'm not really sure. But for now, give it an initial value of the variance of observed year-to-year FT%. Or zero. It doesn't really matter for now, you'll be changing it later.

For every row, calculate {A} and {B}. Next to that, label two columns {1} and {2}, and enter the following formulas:

Code:
{1} = {A}/{B}

{2} =  1/{B}


I did this for each year in Shaq's career from 1993 to 2001.

The true ability variance is calculated as sum({1})/sum({2}). Fiddle around with the FUDGE value until VAR converges on a stable solutions (ie doesn't change much). I ended up settling on FUDGE=0, and that gave me VAR = 0.05%. Now I can plug it into the regression formula above.

Code:
Regressed FT% = [ObservedFT%/Variance(ObservedFT%) + MeanFT%/Variance(MeanFT%)]
                 /   [ 1/Variance(ObservedFT%) + 1/Variance(MeanFT%)]

              = (71% / (71%*(1-71%)/48) + 53%/0.05%)/(1/(71%*(1-71%)/48) + 1/0.05%)

              = 53.5%


What that all means is that our best guess for Shaq's true (rather than observed) FT shooting during those games during which he shot 71% is actually 53.5%. If you look closely at the regression formula, you'll see that the observed FT% is regressed heavily when the variance of true FT% ability is relatively small, or when the variance of observed FT% is large—as in the case above, when the true variance was 0.05% and the observed variance was 0.43%. When the true variance is large and the observed variance is small, the FT% is regressed very little. The take home message here is that 48 FTAs is a very, very small sample—you saw that even Shaq's 71% was regressed 98% to the mean.

* * * * * *

The method for regressing to the mean outlined above can be used for any stat which has a binary outcome on each trial (ie binomial)—win/lose, hit/miss, offensive rebound/defensive rebound, etc. It cannot be used for any stat that has more than two possible outcomes on each trial (multinomial)—eg EFG%, TS%, etc. In my next post I'll show how to regress multinomials.

* * * * * *

Oh, by the way, Shaq went on that season to shoot 55.5%.
_________________
ed


Last edited by Ed Küpfer on Sun Mar 19, 2006 7:25 pm; edited 1 time in total
Back to top
View user's profile Send private message
Ed Küpfer



Joined: 30 Dec 2004
Posts: 647
Location: Toronto

PostPosted: Sun Mar 19, 2006 7:24 pm    Post subject: Reply with quote

In my earlier post, I described a method for regressing binomial stats, like FT% or win%. Here I will describe how to do the same thing for multinomials, which are stats that can have more than two outcomes, like EFG% (miss, two points, or three points).

The first thing you need to do is to break the formula down into its constituent parts. EFG% is normally calculated as (2ptM + 3ptM*1.5)/FGA, but we need to express it as the probability weighted mean of the likelihood of each event. For EFG%, this is

Code:
2EFG% = p(0 pts)*0 + p(2 pts)*2 + p(3 pts)*3

EFG% = [p(0 pts)*0 + p(2 pts)*2 + p(3 pts)*3]/2


The whole thing is divided by two because each real point in the EFG formula has a weight of 0.5. You can also write the EFG equation by using weights with half the value:

Code:
EFG% = p(0 pts)*0/2 + p(2 pts)*2/2 + p(3 pts)*3/2
     = p(0 pts)*0 + p(2 pts)*1 + p(3 pts)*1.5


p(0 pts) is the probability of a player getting 0 points on an attempt. This is the sum of all missed shots divided by the sum of all attempts.

Code:
p(0 pts) = (2PtsAtt - 2PtsMade + 3PtsAtt - 3PtsMade)/(2PtsAtt + 3PtsAtt)
p(2 pts) = (2PtsMade)/(2PtsAtt + 3PtsAtt)
p(3 pts) = (3PtsMade)/(2PtsAtt + 3PtsAtt)

EFG% can also be formulated like this:

Code:
EFG% = w0*p0 + w2*p2 + w3*p3

where the w prefix stands for "points weight" and p stands for probability. The weight is simply the amount of points that event is worth (ie 0, 2, or 3 points, each divided by two). The p value is the probability of that event (eg for 0 points, the probability is the proportion of all attempts that ended as missed shots).

To calculate the variance of EFG%, you need another formula (call it x2), similar to the EFG% one above, but with the points weights squared:

Code:
x2 = (w0)^2*p0 + (w2)^2*p2 + (w3)^2*p3
   = 0*p0 + 4*p2 + 9*p3

The variance of EFG is calucalted then as

Code:
VAR(EFG) = [x2 - (EFG%)^2]/(2PtsAtt + 3PtsAtt)

Now, let's say you want to do the same thing for TS%. Unfortunately, it gets even more complicated.

Remember that TS% is calculate by

Code:
TS% = PTS / (2 * (FGA + 0.44*FTA))

You can also write that as

Code:
TS% = (w0*p0 + w1*p1 + w2*p2 + w3*p3)

Where w is the points weight of each event, and p is the probability of that event taking place.

Because it TS% there are different ways to end up with some of the events (eg 2pts = 2 FTM or 2ptM) we have to calculate the probability of each event taking place, keeping in mind the different ways this can happen. Even for something as simple as TS%, it gets a little complicated. To make things easier on myself, I'm going to make two assumptions: that no player gets an "and 1" on a 3-point attempt, and no player is awarded 3 FTAs on a shooting foul. Because of the relative scarcity of these event, I don't think my assumptions change the values significantly.

Now, let's go through every conceivable outcome:

Code:
TYPE                      OUTCOME                                   PTS AWARDED     CODE

                      +-- Misses 2 FTA --------------------------------- 0           A
                      |
Player fouled, 2 FTA -+-- Makes 1 FTA, Misses 1 FTA  ------------------- 1           B
                      |
                      +-- Makes 2 FTA -----------------------------------2           C

                      +-- Makes, no And 1 ------------------------------ 2           D
                      |
                      |                          +-- Makes And1 -------- 3           E
Two Point attempt   --+-- Makes, And 1 awarded --+
                      |                          +-- Misses And 1 ------ 2           F
                      |
                      +-- Misses --------------------------------------- 0           G

                      +-- Makes ---------------------------------------- 3           H
Three Point Attempt --+
                      +-- Misses --------------------------------------- 0           I


I'll refer to each of these events by the code from the last column. There's one constant that I will use: COEF = 0.44. This is the free throw coefficient which represents the proportion of all free throw attempts that act as the back end of a 2 shot foul. It can also be seen to represent the number of times a player is fouled, by multiplying it by FTA.

A = COEF * FTA * (1 - FT%)^2
The probability of taking and missing 2 FTA is calculated by the probability of a player being fouled times the squared probability of a missed FTA

B = COEF * FTA * FT% * (1 - FT%) * 2


C = COEF * FTA * (FT%)^2


D = 2M - FTA * (1 - 2COEF)
The 2M term includes made shots that were awarded "and 1s". The second part subtracts the and 1ed made shots. If 0.44 is the proportion of FTA that are the back end of a 2-shot foul, then 2 * COEF is the proportion of FTAs that are both the front and back ends. (1 - 2COEF) is the proportion of FTAs that are *not* 2-shot shooting fouls, ie and 1s.

E = FTA * (1 - 2COEF) * FT%
These are the and 1s that we subtracted from D, multiplied by the probability of converting on the FTA.

F = FTA * (1 - 2COEF) * (1 - FT%)

G = 2A - 2M

H = 3M

I = 3A - 3M

The number of zero point events is then the sum of all events that result in zero points: A + G + I. One point events is equal to B. The number of two point events is C + D + F. And three point event is E + H. The probability of each of these is the number of times each one took place, divided by the sum of all events. For example, the probability of a two point event is

Code:
p2 = (C + D + F)/(A + B + C + D + E + F + G + H + I)
   = (C + D + F)/(2 * (FGA + 0.44 * FTA))

TS% can then be reformulated as the probability weighted sum of all events, similar to what we did with EFG%:

Code:
TS% = w0*p0 + w1*p1 + w2*p2 + w3*p3

where the weight is simply the points value of the event divided by two.

Code:
TS% = 0*p(A + G + I) + 0.5*p(B) + 1*p(C + D + F) + 1.5*p(E + H)

And as above, by squaring the weights you'll get x2:

Code:
x2 = 0*p(A + G + I) + 0.25*p(B) + 1*p(C + D + F) + 2.25*p(E + H)

The variance of TS% is x2 minus the square of TS%, all divided by the sum of all events:

Code:
VAR(TS%) = (x2 - (TS%)^2)/(A + B + C + D + E + F + G + H + I)
          = (x2 - (TS%)^2)/(2 * (FGA + 0.44 * FTA))


To regress a multinomial, you use the same equation from the first post:

Code:
Regressed % = (Obs%/VAR(Obs%) + Mean%/VAR(Mean%))/(1/VAR(Obs%) + 1/VAR(Mean%))


where Obs% is the player multinomial in question, VAR is the variance of that multinomial as calculated abovem Mean% is the reference mean of the multinomial, and VAR(Mean%) is the variance of the mean reference multinomial, which is calculated in the same manner as described for binomials in the post above.
_________________
ed
Back to top
View user's profile Send private message
Ed Küpfer



Joined: 30 Dec 2004
Posts: 647
Location: Toronto

PostPosted: Mon Mar 20, 2006 12:35 pm    Post subject: Reply with quote

Using the methods I described in the previous posts, I calculated the variance of the distribution for 4 shooting stats amoung the population of NBA players. My dataset included all player-seasons since 1977-78 in which the players had more than 50 attempts. To calculate the variances, I first subrtracted that season's league-wide mean from each players %, and then added 50% (except for 3-pt%, where I added 33%, and FT%, where I added 75%). This centered each season around the same mean, making for for useful comparisons.

Distribution of true shooting ability:

Code:
        SD      VAR
P2%     0.034   0.0012
P3%     0.150   0.0226
EFG%    0.037   0.0014
TS%     0.145   0.0209
FT%     0.085   0.0073

You can see that the 2-pt and EFG% shooting ability is pretty well concentrated, while TS% and 3-pt ability is distributed much more widely. FT% is somewhere in between. What this means is that, all else being equal, a player's 2-pt% and EFG% should be heavily regressed to the league-wide average, his FT% moderately regressed, and his TS% and 3-pt% lightly regressed, ie retaining much of the original value.

Here's a hack to calculate how much to regress, without going through a lot of trouble:

Code:
Regressed% = Player% - (Player% - League%)*(Constant/sqrt(Attempts))


Code:
%     Constant     Attempts

P2%     5          2att
P3%     1          3att
EFG%    5          2att + 3att
TS%     1          2att + 3att + .44*FTAtt
FT%     2.5        FTAtt

Here are some players from this season with their actual and regressed %ages, just to give you a feel for it.

Code:
PLAYER          GP    Min   2a  3A  FTA                     P2%   P3%    EFG%   TS%    FT%

abdur-rahim,sha 55   1558  473  19  228     Actual          53     26     52     59     80
                                            Regressed       51     29     51     59     80

allen,ray       61   2397  667 512  280     Actual          49     41     54     59     90
                                            Regressed       48     41     53     59     89

andersen,chris  32    568   98   0  103     Actual          57            57     56     48
                                            Regressed       51            52     56     54

anderson,shando 32    355   71   8   13     Actual          42            42     44     69
                                            Regressed       46            47     45     73

araujo,rafael   40    452   93   1   14     Actual          40            39     41     57
                                            Regressed       45            45     42     69

armstrong,darre 47    413   56  27   29     Actual          39     19     36     42     76
                                            Regressed       46     22     45     44     75

atkins,chucky   54   1251  226 191  104     Actual          42     37     48     52     78
                                            Regressed       45     37     48     52     77

banks,marcus    41    912  249  49   90     Actual          51     31     50     54     78
                                            Regressed       49     31     49     54     77

_________________
ed
Back to top
View user's profile Send private message
Ed Küpfer



Joined: 30 Dec 2004
Posts: 647
Location: Toronto

PostPosted: Mon Mar 27, 2006 2:35 am    Post subject: Reply with quote

Why don't I just go ahead and use this thread to dump numbers? Yes, ed, why don't you? You have my permission.

Using the methods outlined above, I looked at individual player rebounding percentages. I wanted to see what the distribution of rebounding ability looked like, within each position.

I used the data from 82games, showing the number of offensive/defensive rebounds and number of offensive/defensive rebounding opportunities for three seasons (02-03 to 04-05). I limited my search to the top 1000 player-team-seasons in rebounding opportunities. My database has every player listed under one of the following positions: PG, G, SG, GF, SF, F, PF, FC, and C. I don't really trust these positions too well, but it's a start. Here's how each position rebounded (average reb%, weighted by opportunities):



That's fairly obvious, and not very interesting. What I was getting at is the distribution of rebounding ability, which I will summarise with the standard deviation (derived by calculating the variance using that long complicated method above):

Code:
POS     OR%     sdOR%   DR%     sdDR%
PG      2.1%    0.9%    8.0%    1.7%
G       2.3%    1.1%    8.6%    1.8%
SG      3.0%    1.5%    9.9%    2.0%
GF      3.5%    1.7%    10.8%   2.4%
SF      4.8%    2.1%    12.5%   3.1%
F       6.0%    2.4%    14.8%   3.6%
PF      7.0%    2.5%    16.2%   3.7%
FC      7.8%    2.3%    17.0%   3.4%
C       8.3%    2.0%    17.1%   3.1%


Note two things from the table above:

1. The bigger the player, the larger the variation in rebounding ability. That is, point guards rebounding ability is homogenous, while the big men have more varied rebounding skills.

2. Offensive rebounding has less variation than defensive rebounding.

Remember the fundemental rule about regressing to the mean: if the variance of the population is small, you have to regress much of your player's numbers. If the population variance is large, you don't have to regress much.

Applying this rule to the two observations, we'll note first that a point guard's rebounding numbers do not have to be regressed all that much. In other words, our best guess for any point guard's rebounding ability is something close to his actual rebounding stats. Second, because defensive rebounding has such a large variance, a player will not have to have his DR% regressed as much as his OR%.

Once again, these won't mean all that much for a season's worth of numbers. But the amount of regression makes a real difference in small samples, like early season stuff.
_________________
ed
Back to top
View user's profile Send private message
cherokee_ACB



Joined: 22 Mar 2006
Posts: 111

PostPosted: Mon Mar 27, 2006 11:16 am    Post subject: Reply with quote

Ed Küpfer wrote:

1. The bigger the player, the larger the variation in rebounding ability. That is, point guards rebounding ability is homogenous, while the big men have more varied rebounding skills.

2. Offensive rebounding has less variation than defensive rebounding.


Is that absolute or relative variation?


Quote:
Applying this rule to the two observations, we'll note first that a point guard's rebounding numbers do not have to be regressed all that much.


I believe you meant the contrary here, that PG rebounds will be more regressed than C numbers, since PG variance is smaller. However, according to your formulaes, the observed variance of guards will be noticeably smaller as well. The net result would actually be a heavier regression for centers vs guards, and for DR% vs OR%.
Back to top
View user's profile Send private message
Ed Küpfer



Joined: 30 Dec 2004
Posts: 647
Location: Toronto

PostPosted: Mon Mar 27, 2006 11:49 am    Post subject: Reply with quote

cherokee_ACB wrote:
Ed Küpfer wrote:
2. Offensive rebounding has less variation than defensive rebounding.


Is that absolute or relative variation?


I'm not sure what you mean. The variation is distributed around the average of each position, not the global average. Is that what you mean by relative?


Quote:
Quote:
Applying this rule to the two observations, we'll note first that a point guard's rebounding numbers do not have to be regressed all that much.


I believe you meant the contrary here, that PG rebounds will be more regressed than C numbers, since PG variance is smaller.


Yes, of course you're right here.

Quote:
However, according to your formulaes, the observed variance of guards will be noticeably smaller as well. The net result would actually be a heavier regression for centers vs guards, and for DR% vs OR%.


In practise, it doesn't always work out that way. Even though the relative individual variance for guards is smaller that for big men (because the reb% are further from 0.5) the differences are not always large enough to overwhelm the difference of the variance of the population. Here's a sample:

Code:
PLAYER                  POS  |  oOR%    rOR%    mOR%   %regOR |  oDR%    rDR%    mDR   %regDR
----------------------------------------------------------------------------------------------
Nick Van Exel           PG   |   1.3     1.3     2.1    10.1  |   7.2     7.3     8.0    16.2
Carlos Arroyo           G    |   2.1     2.1     2.3    19.9  |   6.4     6.9     8.6    22.4
Ray Allen               SG   |   3.8     3.7     3.0    11.7  |  10.2    10.2     9.9    16.4
Luke Walton             SF   |   5.4     5.3     4.8    14.3  |  12.9    12.8    12.5    14.2
Richard Jefferson       F    |   3.9     4.0     6.0     4.7  |  14.6    14.6    14.8     6.8
Michael Ruffin          PF   |  10.7    10.0     7.0    18.1  |  20.8    20.1    16.2    16.2
Nene                    FC   |   8.2     8.2     7.8     9.9  |  16.9    16.9    17.0     8.7
Zaza Pachulia           C    |   9.9     9.5     8.3    23.9  |  16.1    16.3    17.1    17.9


The "o" prefix stands for the observed player reb%, "r" for regressed, "m" for the position mean, "%reg" is the percentage regressed. The amount of regression is heavily dependant on playing time.
_________________
ed
Back to top
View user's profile Send private message
cherokee_ACB



Joined: 22 Mar 2006
Posts: 111

PostPosted: Mon Mar 27, 2006 12:39 pm    Post subject: Reply with quote

By absolute vs relative I meant whether

POS OR% sdOR%
PG 2.1% 0.9%

means 2.1 +- 0.9 (absolute) or 2.1 +- 0.009*2.1 (relative to the mean). I assume the former, but the % symbol can be confusing. If that's the case, I'm not sure I'd say that the rebounding ability of small guys is more homogeneous. Rebounding numbers of guards seem to be more spread around their mean (they have a higher stdev/mean ratio) than those of inside players. Anyway, it's just semantics.

Quote:
Even though the relative individual variance for guards is smaller that for big men (because the reb% are further from 0.5) the differences are not always large enough to overwhelm the difference of the variance of the population.


Yeah, you are right in most cases (look at SF vs FC for an exception). I had made the intuitive estimations with the stdev numbers you presented, but of course I should have used the variance, whose differences are much higher.
Back to top
View user's profile Send private message
Ed Küpfer



Joined: 30 Dec 2004
Posts: 647
Location: Toronto

PostPosted: Mon Mar 27, 2006 12:50 pm    Post subject: Reply with quote

cherokee_ACB wrote:
By absolute vs relative I meant whether

POS OR% sdOR%
PG 2.1% 0.9%

means 2.1 +- 0.9 (absolute) or 2.1 +- 0.009*2.1 (relative to the mean). I assume the former, but the % symbol can be confusing.


Ah, I see. Yes, I used absolute variation. What matters to me is the variance itself, not the means, but it's easy enough to multiply the two if needed. The rest of your comment regards my sloppy writing, and, yes, you have a point WRT the homogeneity.
_________________
ed
Back to top
View user's profile Send private message
mgl



Joined: 27 Mar 2006
Posts: 2

PostPosted: Mon Mar 27, 2006 3:08 pm    Post subject: Reply with quote

Ed, great stuff! I am one of the authors (Lichtman) of the aforementioned book (The Book), although Dolphin, who is a brilliant statistician as well as sabermetrician, wrote the appendix.
Back to top
View user's profile Send private message
Ed Küpfer



Joined: 30 Dec 2004
Posts: 647
Location: Toronto

PostPosted: Mon Mar 27, 2006 5:00 pm    Post subject: Reply with quote

mgl wrote:
Ed, great stuff! I am one of the authors (Lichtman) of the aforementioned book (The Book), although Dolphin, who is a brilliant statistician as well as sabermetrician, wrote the appendix.


Hey. I've been following your posts/rants on regressing (among other things) for years now. Good to have the whole thing in a mathematically rigourous form. I'm currently pimping the book out to anyone who'll listen, so maybe you'll have a good segment of hoopheads in among your audience.
_________________
ed
Back to top
View user's profile Send private message
Ed Küpfer



Joined: 30 Dec 2004
Posts: 647
Location: Toronto

PostPosted: Mon Mar 27, 2006 6:54 pm    Post subject: Reply with quote

Using the same A B C method as for TS% from above, I calculated the variances of team Offensive and Defensive Ratings. (Slight change: I added a category J, equivelant to the number of team turnovers, and from the p0 category I subtracted the number of team Offiensive Rebounds. This made sure that p0 + p1 + p2 + p3 = Total Possessions = FGA + 0.44 * FTA - OR + TO.) I found something odd: the variation in ORTG among teams is about 20% higher than the variation of DRTG. This difference is consistent no matter what era you look at. Go figure. Anyway, the standard deviation of ORTG is 3.0, and for DRTG is 2.6.
_________________
ed
Back to top
View user's profile Send private message
TexasEx



Joined: 12 May 2006
Posts: 28
Location: Houston, TX

PostPosted: Mon May 26, 2008 11:35 am    Post subject: Reply with quote

Ed - In your first post, you base your regression off Shaq's first 3 games of the 2002 season. What if you don't have any data for the first few games of the season you want to forecast? Do you just take the info from the previous year and then do the remainder of the steps you lay out?
Back to top
View user's profile Send private message
Eli W



Joined: 01 Feb 2005
Posts: 369

PostPosted: Mon May 26, 2008 1:05 pm    Post subject: Reply with quote

TexasEx wrote:
Ed - In your first post, you base your regression off Shaq's first 3 games of the 2002 season. What if you don't have any data for the first few games of the season you want to forecast? Do you just take the info from the previous year and then do the remainder of the steps you lay out?


Ed disagrees with me, but I don't think it makes sense to regress a player's current season rate to his past season rates, like in the Shaq example. If I wanted to predict Shaq's FT% for the season (or estimate his true FT% skill), I would just use his career FT% (including whatever part of the current season has been played) as my prediction/estimation and be done with it (if you wanted, some sophistication could be added by weighting recent seasons more heavily than older seasons). If Shaq had only been in the league for a year or so and not taken many free throws, then I would regress toward the average FT% for centers.

I think regression to the mean should be used to regress a player toward a population of other players, rather than toward a population of his past selves.
_________________
Eli W. (formerly John Quincy)
CountTheBasket.com
Back to top
View user's profile Send private message Send e-mail Visit poster's website
Ed Küpfer



Joined: 30 Dec 2004
Posts: 647
Location: Toronto

PostPosted: Mon May 26, 2008 1:17 pm    Post subject: Reply with quote

At this point Eli knows more about this stuff than I do, so you should probably be listening to what he has to say. But it's worth keeping in mind what RTM is all about: you are combining two numbers (based on the relative confidence you have in each number) to forecast future performance. Eli is trying to get at the nuts and bolts of the combination method, and it's lucky that someone is – I don't have the patience for that anymore. I take a more results-oreiented approach, and use whatever works. "Whatever works" depends on how you define "works" – I'm looking forward to seeing what Eli comes up with.
_________________
ed
Back to top
View user's profile Send private message
Harold Almonte



Joined: 04 Aug 2006
Posts: 430

PostPosted: Mon May 26, 2008 7:58 pm    Post subject: Reply with quote

An out of topic observation:
Quote:
Distribution of true shooting ability:

Code:
SD VAR
P2% 0.034 0.0012
P3% 0.150 0.0226
EFG% 0.037 0.0014
TS% 0.145 0.0209
FT% 0.085 0.0073

You can see that the 2-pt and EFG% shooting ability is pretty well concentrated, while TS% and 3-pt ability is distributed much more widely. FT% is somewhere in between. What this means is that, all else being equal, a player's 2-pt% and EFG% should be heavily regressed to the league-wide average, his FT% moderately regressed, and his TS% and 3-pt% lightly regressed, ie retaining much of the original value.


The 3 point, and FT shooting are delimited and forced by a line of distance, and a kind of defense; 2 point shooting has a large variation in distances and way of scoring (assisted or not-driving or not, posted, layup, dunks, tips, or jumpshots, etc). I think if we split in every kind of 2p shooting, the shooting skills deviations won't be so concentrated.

I think the right approach should be: "outside the kind of usage-eff. tradeoff issue, the 2p SCORING ability appears to be pretty well concentrated".
Back to top
View user's profile Send private message
Display posts from previous:   
Post new topic   Reply to topic    APBRmetrics Forum Index -> General discussion All times are GMT - 5 Hours
Page 1 of 1

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum


Powered by phpBB © 2001, 2005 phpBB Group