APBRmetrics

Ed Küpfer · Joined: 30 Dec 2004 Posts: 785 Location: Toronto

I wanted to see how well we could predict games missed due to injury from the past. Seemed simple enough: regress games missed in year Y onto games missed in year Y-1. Maybe toss in age and position as well.

A big problem became apparent almost immediately: players miss games for all kinds of reasons, not just injuries. In fact, we can separate the causes for missed games into three categories:

1. Player - injuries, personal reasons (funeral, wedding, etc.), salary holdout, drug-fueled bender.
2. Coach - strategic decision not to use player, disciplinary reasons, etc,
3. League - suspensions.

(Are there more? I can't think of any.)

I'll deal with these in reverse order. From Patricia Bender's site, I downloaded her daily reports, and searched for the keyword 'suspen*'. I went through these manually and assigned the final suspensions (sometimes they're reduced after-the-fact) to the players. Her dailies only go back to the 94-95 season, so that's my sample. I then calculated an "effective games played" for every player, which is the sum of their actual games played and the number of games missed due to suspension. I calculated an "effective minutes played" stats in a similar way.

(BTW I scaled up all games played, games missed, and minutes played stats from 1999 by multiplying them by 82/50. Also, due to trades, some players end up playing more than 82 games in a season. I maxed out all games played at 82.)

For an analysis of this scale, it's hard to see how coaching decisions affect the number of games a player misses. To get around this problem, I decided to focus on players who are least affected by strategic coaching decisions: starters. I calculated a weighted games started-percentage stat, like this:

WGS% = ((GStarted_1 / GP_1)*3 + (GStarted_2 / GP_2)*2 + (GStarted_3 / GP_3)*1)/6

GS_year = games started, where the subscript year is the number of years previous, ie year = 2 means two years previously.
GP_year = games actually played, not effective games played.

For this study, I included only players who, in the previous three seasons, had started 50% or more of the games in which they played. I'm pretty sure that these type of players are ones who wouldn't miss a game because the coach felt he didn't need them. And although they might miss a game because of disciplinary reasons, Patricia would've mentioned an offense serious enough for the player to miss more than one game (She lists plenty of 1-game suspensions as it is.)

With regard to player reasons for missing a game, there's no much I can do here. Rony Seikaly was suspended for one game for his 1997 holdout -- I can't find any mention of others. I know players have missed games here and there for funerals and other personal reasons, but I don't think they should affect this study too much. The drug-fueled benders will also be coded in my sample as injuries.

One more thing: I restricted my sample to players who were between 24 and 30 years old. This gave me a total of 848 player-seasons.

Now then. Here's some diagnostic plots showing the relationship between games missed in one season and the previous two seasons, along with their distribution.

As you can see, a linear model won't work very well at all. I decided to use two robust fitting model, quantile regression (QR) and robust linear regression (RLM). Also, decided not to use age as a predictor, since I'm only looking at players in a 7 year window. "Wear and tear" is accounted for by "minutes played in previous season" variables. To sum up, I'm regressing Games Missed in one season against Games Missed In Previous Season and Games Missed In Season Before That. Glad we got that cleared up.

You all want to see more graphs. I can tell these things. Well, get a load of this one:

That shows the number of games missed (the vertical axis) and the number of games missed in the previous season (horizontal axis), similar to one of the plots above. Those grey ticks along the top and right show the distribution of observations -- you can see that they are concentrated at the lower values, showing that most players in my sample missed few games.

The lines show the result of a robust regression model. Robust linear regression corresponds closely to the bread and butter linear regression we all know and love, and we can interpret the results (shown below) the same way.

The dashed line on the bottom shows the function of the regression assuming the player had missed 10 games in the season 2 years previous. The solid line shows the same regression line, but assuming the player had missed 20 games 2 years before. The topmost dashed line is the same, except assuming 30 games missed.

The intercept is equal to six games missed, and the coefficient for games missed in previous season is 0.17, and the coefficient for games missed two seasons previously is 0.07. That means the expected number of games missed is 6 plus the number of games missed one season before divided by six -- for every six games missed one season ago, a player expects to miss only one game. A player needs to miss 20 games the season two years before to expect to miss a single game. In other words, games missed last year are not a strong predictor of future games missed, and games missed two seasons previously can effectively be ignored. This effect is displayed graphically in the plot above, where the marginal difference between missing 30 games 2 years previously and missing 10 games 2 years previously is only about one and a half games.

One of the problems with dealing with skewed distributions like this one is that regressing against the mean doesn't tell us anything interesting. The mean number of games missed is 15, but is there any real difference between missing 15 games and missing 10 games, as far as injuries are concerned? I mean, of course there is, but are we going to see it using my data? I'll bet that a lot of those are games missed due to other reasons: DNP-CDs, personal reasons, tweaked an ankle in practice and the game isn't important enough to risk real injury, etc. However, a player missing 30 games isn't likely to have missed most of those for non-injury reasons. To deal with those, I'll have to use my favourite robust technique, quantile regression, which I'll go through in a later post.

tmansback · Joined: 12 Aug 2005 Posts: 129

A lot to read. Just pencil in Larry Hughes next year for 55-65 games and Gary Payton to not miss a game unless its due to suspension.

THWilson · Joined: 19 Jul 2005 Posts: 164 Location: phoenix

Very interesting as always, Ed.

It's not clear from the presentation (unless I just missed it) whether your previous season minutes include playoff games. I've heard this mentioned before as the reason that players like Tim Duncan and Shaq miss extra games, because they always go deep into the playoffs and have an extra ~800 min. wear/tear.

Ed Küpfer · Joined: 30 Dec 2004 Posts: 785 Location: Toronto

I didn't include playoff games. I've still got another regression to run, so maybe I'll include them. I'll make a prediction now: playoff games won't change the slope significantly.
_________________
ed