We’ve shown time and time again how helpful a team’s shot rates are in projecting how well that team is likely to do going forward. To this point, however, data has always been contained in-season, ignoring what teams did in past seasons. Since most teams keep large percentages of their personnel, it’s worth looking into the predictive power of last season.
We don’t currently have shot locations for previous seasons, but we do have general shot data going back to 2011. This means that I can look at all the 2012 and 2013 teams, and how important their 2011 and 2012 seasons were, respectively. Here goes.
First, I split each of the 2012 and 2013 seasons into two halves, calculating stats from each half. Let’s start by leaving out the previous season’s data. Here is the predictive power of shot rates and finishing rates, where the response variable is second-half goal differential.
|Attempt Diff (first 17)||0.14244||0.00%|
|Finishing Diff (first 17)||77.06047||1.18%|
To summarize, I used total shot attempt differential and finishing rate differential from the first 17 games to predict the goal differential for each team in the final 17 games. Also, I controlled for how many home games each team had remaining. The sample size here is the 56 team-seasons from 2011 through 2013. All three variables are significant in the model, though the individual slopes should be interpreted carefully.*
The residual standard error for this model is high at 6.4 goals of differential. Soccer is random, and predicting exact goal differentials is impossible, but that doesn’t mean this regression is worthless. The R-squared value is 0.574, though as James Grayson has pointed out to me, the square root of that figure (0.757) makes more intuitive sense. One might say that we are capable of explaining 57.4 percent of the variance in second-half goal differentials, or 75.7 percent of the standard deviation (sort of). Either way, we’re explaining something, and that’s cool.
But we’re here to talk about the effects of last season, so without further mumbo jumbo, the results of a more-involved linear regression:
|Attempt Diff (first 17)||0.12426||0.03%|
|Attempt Diff (last season)||0.02144||28.03%|
|Finishing Diff (first 17)||93.27359||1.14%|
|Finishing Diff (last season)||72.69412||12.09%|
Now we’ve added teams’ shot and finishing differentials from the previous season. Obviously, I had to cut out the 2011 data (since 2010 is not available to me currently), as well as Montreal’s 2012 season (since they made no Impact in 2011**). This left me with a sample size of 37 teams. Though the residual standard error was a little higher at 6.6 goals, the regression now explained 65.2 percent of the variance in second-half goal differential. Larger sample sizes would be nice, and I’ll work on that, but for now it seems that—even halfway through a season—the previous season’s data may improve the projection, especially when it comes to finishing rates.
But what about projecting outcomes for, say, a team’s fourth game of the season? Using its rates from just three games of the current season would lead to shaky projections at best. I theorize that, as a season progresses, the current season’s data get more and more important for the prediction, while the previous season’s data become relatively less important.
My results were most assuredly inconclusive, but leaned in a rather strange direction. The previous season’s shot data was seemingly more helpful in predicting outcomes during the second half of the season than it was in the first half—except, of course, the first few weeks of the season. Specifically, the previous season’s shot data was more helpful for predicting games from weeks 21 to 35 than it was from weeks 6 to 20. This was true for finishing rates, as well, and led me to recheck my data. The data was errorless, and now I’m left to explain why information from a team’s previous season helps project game outcomes in the second half of the current season better than the first half.
Anybody want to take a look? Here are the results of some logistic regression models. Note that the coefficients represent the estimated change in (natural) log odds of a home victory.
|Weeks 6 – 20||Coefficient||P-value|
|Home Shot Diff||0.139||0.35%|
|H Shot Diff (previous)||-0.073||29.30%|
|Away Shot Diff||-0.079||7.61%|
|A Shot Diff (previous)||-0.052||47.09%|
|Weeks 21 – 35||Coefficient||P-value|
|Home Shot Diff||0.087||19.37%|
|H Shot Diff (previous)||0.181||6.01%|
|Away Shot Diff||-0.096||15.78%|
|A Shot Diff (previous)||-0.181||4.85%|
Later on in the season, during weeks 21 to 35, the previous season’s data actually appears to become more important to the prediction than the current season’s data—both in statistical significance and actual significance. This despite the current season’s shot data being based on an ample sample of at least 19 games (depending on the specific match in the data set). So I guess I’m comfortable saying that last season matters, but I’m still confused—a condition I face daily.
*The model suggests that each additional home game remaining projects a three-goal improvement in differential (3.37, actually). In a vacuum, that makes no sense. However, we are not vacuuming. Teams that have more home games remaining have also played a tougher schedule. Thus the +3.37 coefficient for each additional home game remaining is also adjusting the projection for teams who’s shot rates are suffering due to playing on the road more frequently.
**Drew hates me right now.