Happy Thanksgiving, all! We have a Conference Finals-centric podcast concerning the outcomes this past weekend’s matches. Also, there was a little discussion that came out of it concerning formations vs. personnel. It’s just us talking and coming up some random thoughts. It’s not that long, but it’s us and we’re having a good chat… I think Matty at one point insults Sebastián Velásquez‘s hair cut, so there is that.
We’ve shown time and time again how helpful a team’s shot rates are in projecting how well that team is likely to do going forward. To this point, however, data has always been contained in-season, ignoring what teams did in past seasons. Since most teams keep large percentages of their personnel, it’s worth looking into the predictive power of last season.
We don’t currently have shot locations for previous seasons, but we do have general shot data going back to 2011. This means that I can look at all the 2012 and 2013 teams, and how important their 2011 and 2012 seasons were, respectively. Here goes.
First, I split each of the 2012 and 2013 seasons into two halves, calculating stats from each half. Let’s start by leaving out the previous season’s data. Here is the predictive power of shot rates and finishing rates, where the response variable is second-half goal differential.
|Attempt Diff (first 17)||0.14244||0.00%|
|Finishing Diff (first 17)||77.06047||1.18%|
To summarize, I used total shot attempt differential and finishing rate differential from the first 17 games to predict the goal differential for each team in the final 17 games. Also, I controlled for how many home games each team had remaining. The sample size here is the 56 team-seasons from 2011 through 2013. All three variables are significant in the model, though the individual slopes should be interpreted carefully.*
The residual standard error for this model is high at 6.4 goals of differential. Soccer is random, and predicting exact goal differentials is impossible, but that doesn’t mean this regression is worthless. The R-squared value is 0.574, though as James Grayson has pointed out to me, the square root of that figure (0.757) makes more intuitive sense. One might say that we are capable of explaining 57.4 percent of the variance in second-half goal differentials, or 75.7 percent of the standard deviation (sort of). Either way, we’re explaining something, and that’s cool.
But we’re here to talk about the effects of last season, so without further mumbo jumbo, the results of a more-involved linear regression:
|Attempt Diff (first 17)||0.12426||0.03%|
|Attempt Diff (last season)||0.02144||28.03%|
|Finishing Diff (first 17)||93.27359||1.14%|
|Finishing Diff (last season)||72.69412||12.09%|
Now we’ve added teams’ shot and finishing differentials from the previous season. Obviously, I had to cut out the 2011 data (since 2010 is not available to me currently), as well as Montreal’s 2012 season (since they made no Impact in 2011**). This left me with a sample size of 37 teams. Though the residual standard error was a little higher at 6.6 goals, the regression now explained 65.2 percent of the variance in second-half goal differential. Larger sample sizes would be nice, and I’ll work on that, but for now it seems that—even halfway through a season—the previous season’s data may improve the projection, especially when it comes to finishing rates.
But what about projecting outcomes for, say, a team’s fourth game of the season? Using its rates from just three games of the current season would lead to shaky projections at best. I theorize that, as a season progresses, the current season’s data get more and more important for the prediction, while the previous season’s data become relatively less important.
My results were most assuredly inconclusive, but leaned in a rather strange direction. The previous season’s shot data was seemingly more helpful in predicting outcomes during the second half of the season than it was in the first half—except, of course, the first few weeks of the season. Specifically, the previous season’s shot data was more helpful for predicting games from weeks 21 to 35 than it was from weeks 6 to 20. This was true for finishing rates, as well, and led me to recheck my data. The data was errorless, and now I’m left to explain why information from a team’s previous season helps project game outcomes in the second half of the current season better than the first half.
Anybody want to take a look? Here are the results of some logistic regression models. Note that the coefficients represent the estimated change in (natural) log odds of a home victory.
|Weeks 6 – 20||Coefficient||P-value|
|Home Shot Diff||0.139||0.35%|
|H Shot Diff (previous)||-0.073||29.30%|
|Away Shot Diff||-0.079||7.61%|
|A Shot Diff (previous)||-0.052||47.09%|
|Weeks 21 – 35||Coefficient||P-value|
|Home Shot Diff||0.087||19.37%|
|H Shot Diff (previous)||0.181||6.01%|
|Away Shot Diff||-0.096||15.78%|
|A Shot Diff (previous)||-0.181||4.85%|
Later on in the season, during weeks 21 to 35, the previous season’s data actually appears to become more important to the prediction than the current season’s data—both in statistical significance and actual significance. This despite the current season’s shot data being based on an ample sample of at least 19 games (depending on the specific match in the data set). So I guess I’m comfortable saying that last season matters, but I’m still confused—a condition I face daily.
*The model suggests that each additional home game remaining projects a three-goal improvement in differential (3.37, actually). In a vacuum, that makes no sense. However, we are not vacuuming. Teams that have more home games remaining have also played a tougher schedule. Thus the +3.37 coefficient for each additional home game remaining is also adjusting the projection for teams who’s shot rates are suffering due to playing on the road more frequently.
**Drew hates me right now.
Hey, guys… we’re back with better audio quality this week. A big thanks to Drew who put things together last week in my place, and despite technology failing apart around them, Drew and Matty were able to put together a great podcast.
This week on the show we tackle MLS playoffs, CONCACAF and USMNT dealings and then some Transfers/Loan rumors that are out there. It’s a longer podcast, but it’d been a few weeks since we all got together, and things just rolled. I hope you enjoy it.
In the wake of Major League Baseball awarding its MVP to Miguel Cabrera, debates over what “valuable” means have once again flared up. Though soccer and baseball are two incredibly different sports, I think we can apply some of the same logic to both MVP discussions. Major League Soccer has about two weeks remaining before its MVP award is handed out, and we will no doubt encounter many of the same controversies in the soccer blogosphere that appear in baseball every season.
The MVP controversy usually begins with what “valuable” means. I think there’s little doubt in most people’s minds that “valuable” and “skilled” are correlated. The main controversy is how correlated. To some, asking who was the best player in Major League Soccer in 2013 would be equivalent to asking who was the most valuable to his team. To others, there would be some key distinctions, the most common of which is that MVPs must come from teams that reach the post season.
In retort to that thinking, some very astute commenters in a Fangraphs.com forum offered up these nuggets. Hendu for Kutch made the analogy:
“We each want to buy something that costs $1. I’ve got a quarter, 8 nickels, and 10 pennies. My ‘team’ of coins is worth 75 cents and falls short of being able to buy the item. You have one dime and 18 nickels. Your ‘team’ is worth $1, and you successfully buy the item. Is your dime more valuable than my quarter simply because it led to a successful item purchase?”
Mike Trout = quarter and Miguel Cabrera = dime, for those of you not so into baseball, and the question is a good one. Few would argue that the dime is more valuable than the quarter just because it found itself in a position to help buy that scrumptious Twix.
In reply to someone arguing that the quarter had no value because it didn’t lead to the purchase of a desired item, BIP and ndavis910 then chimed in:
“Except not everything costs $1, and at any rate, you would always choose the quarter over the dime when accumulating money for a purchase.”
“Especially when you don’t know the cost of the items until you get to the store. In baseball, a team cannot be sure how many wins it will take to reach the playoffs until the last day of the season. In your example, the quarter is the most valuable piece regardless of whether or not the item cost $1 or $0.75.”
When thinking about attributing value to players like Marco Di Vaio, Mike Magee, Camilo Sanvezzo, Robbie Keane and company, why should it matter where their teams finished? If one believes that Magee, for instance, is the best player in MLS, then does it matter if he took his team from 39 points to 49, versus from 40 points to 50? Either way, it’s still ten points of value in the standings. When Magee was traded to Chicago, neither Chicago nor Magee knew that the Fire was going to need 50 points to make the playoffs. The fact that they got just 49 points shouldn’t negate any of Magee’s value.
If you say that it matters because MLS clubs get real value from extra playoff games, then think about this. Playoff cutoff lines are quite arbitrary. If MLS allowed only the top two teams in from each conference—not completely unreasonable for a league of just 19 teams—then none of the players mentioned above would be considered under this playoffs requirement. Playoffs represent an arbitrary bar that the players competing for the award don’t get to set, and while reaching the playoffs does bring the team measurable revenue and value, basing an award on something outside an individual’s control would, in my opinion, strip the award of its intended meaning and purpose.
Now let’s anticipate the logical counterargument—that players pick up their games in playoff races and play well when it matters most.
For a moment, let’s ignore the fact that little evidence has ever been found in professional sports that players can turn it on and turn it off as needed. This past season, Magee scored seven goals in Chicago’s final nine games, a stretch in which the team averaged 1.56 points per match. That represents a pace that would have gotten the Fire into the playoffs if maintained for the entire season. Di Vaio scored five goals in his last 10 games—I even included that tenth-to-last game in which he scored two goals—in a stretch where Montreal tallied just 0.7 points per match, limping into the playoffs on a tie-breaker with Chicago. Just because one team makes the playoffs doesn’t mean its best player was at his peak when it mattered. Goals are, admittedly, a narrow-minded way to measure a striker’s value, but I think the point is still valid.
For me, the Magee-Di Vaio example above may have been no more than an exercise in confirmation bias. I chose to see what I already believed. However, the logic behind the belief that team standings shouldn’t matter to players’ MVP merits is still good stuff, and transcends any biased example I can come up with.
If we’re ready to agree that that the MVP award should essentially be given to the best overall player, then we still have a tall task ahead of us. How do we measure skill on the soccer field? That is the 64,000-dollar question, and one we hope to help tackle here at ASA some day. But perhaps it’s not so crazy to think that a guy like Federico Higuain is deserving of the MVP award. If you scoff at that notion, you likely do so because you’ve been trained to think about MVP awards in a certain way.
We’re all about re-thinking things around here.
This week we talked about how cool and hip we are, followed by a discussion of the first legs of the MLS Cup semifinals. We continued with potential changes to MLS’ CONCACAF Champions League births, Klinsmann’s 23 man roster for the upcoming friendlies versus Scotland and Austria, and the top 50 players in MLS by pass completion percentage. We concluded with a discussion of burritos and proper burrito folding practices.
A look at the 4-2 scoreline may give the appearance that Real Salt Lake shredded Portland’s defense in an wide-open free-for-all. On the contrary, two of RSL’s goals came directly from corner kicks, while a third was courtesy of the generosity and stone touch of Futty Danso (who was also marking Schuler on RSL’s first goal). Credit should of course go to Salt Lake for piling on the pressure, but what really characterized Real Salt Lake’s play on Sunday was not a free-flowing attack, but rather excellent team defense and a commitment to attacking via the flanks.
No Space for Portland
Throughout the match, Real Salt Lake’s defensive shape remained resolute, and never came close to being broken down by Portland’s 4-3-3. Kyle Beckerman was, as ever, the linchpin of RSL’s midfield, leading the team in aerial duels won with 6 (of 7) and tackles (4, tied with Tony Beltran), and contributing 6 clearances. However, the incessant pressure of Sebastian Velazquez and Luis Gil—who it should be noted are 19 and 22 years old, respectively—along with the fullback pairing of Beltran (who led RSL in touches with 76) and Chris Wingert/Lovel Palmer, never allowed any space for Diego Valeri or Darlington Nagbe to work their magic in the midfield. Many of Portland’s forays into the penalty area stemmed from Rodney Wallace collecting the ball in wide positions and sending in listless crosses (0-for-6) that were easily dealt with by Nat Borchers. Forward Ryan Johnson was kept in check all game, limited to a mere 18 touches in his 59 minutes on the field.
The entirety of Portland’s productive offensive output consisted of Will Johnson’s free kick goal, Piquionne’s soaring headed goal, and a 77th minute shot from Alhassan after a slick dribbling spell through the heart of RSL’s midfield. For the entire game, Portland had only two successful dribbles and three successful crosses in the attacking third (one of which was Jewsbury’s beautiful assist).
Defending from the Front
The only change in the starting lineup for Real Salt Lake to start the game was Devon Sandoval replacing an ailing Alvaro Saborio. While few would argue that Sandoval is the better player, his kinetic style, defensive workrate, and ability to get into wide spaces provided problems for the Great Wall of Gambia.
Chalkboards of Devon Sandoval vs. Portland (left) and Alvaro Saborio vs. Los Angeles (right)
As you can see, the defense starts from the front. Sandoval pressured wide all game long, trying to disrupt Portland’s rhythm in the defensive half of the field. Of Sandoval’s 43 actions against Portland, only 11 (25.6%) took place in the center third of the field, compared to 15 of 28 (53.6%) for Saborio against Los Angeles. Sandoval also pressured back more than Saborio did: 8 of 43 (18.6%) actions by Sandoval took place in RSL’s half of the field, compared to a meager 2 of 28 for Saborio (7.1%).
Stretching the Diamond
What really stuck out about the way that Real Salt Lake played, however, was the way that their midfield “diamond” stretched from touchline-to-touchline, with Velazquez manning the left, Gil hugging the right, and Morales drifting from side-to-side, looking for an inch of space wherever he could find it.
Here is a chalkboard of passes attempted by Real Salt Lake, along with the percentage of passes attempted from each section of the field:
And here are all of the passes attempted by Portland, along with the percentage breakdown:
Real Salt Lake attempted only 13.6% of their passes from the central attacking portions of the field, while 64.3% of their passes came from the wide attacking areas. Portland, by contrast, attempted 18.9% of their passes from the central areas, and 58.6% of their passes coming from the wide attacking zones.
RSL ratio of wide-attacking passes to central-attacking passes: 4.73-to-1
POR ratio of wide-attacking passes to central-attacking passes: 3.10-to-1
Real Salt Lake took their chances against Portland’s flank defense rather than try to fight through Will Johnson and Diego Chara. The gambit worked well, as all eight of RSL’s key passes and assists came from wide positions.
Three questions for leg 2 in Portland:
1. Will Saborio be healthy? If so, Sandoval will likely see the bench again as Findley’s speed will serve as an outlet against a high-pressing, possibly desperate Timbers squad, unless…
2. Kreis opts for the 4-2-3-1? Beckerman and Yordany Alvarez were deployed in a double pivot at Los Angeles a few weeks ago, and while the results were not exactly convincing, it perhaps implies (or at least I’m inferring) that Kreis may want to take a more conservative approach on the road in the playoffs.
3. Ryan Johnson or Frederic Piquionne? Ryan Johnson has put in a workmanlike effort thus far in the playoffs, but with his playing time diminishing each game (83 min @ SEA, 69 min v SEA, 59 min @ RSL) and Piquionne finally healthy (and able to leap clear over Nat Borchers), it may be time for Piquionne to crack the starting lineup.
Though our game states data set doesn’t yet include all of 2013, it still includes 137 games. In those 137 games, only five home teams ever went down three goals, and all five teams lost. There were 24 games in which the home team went down two goals, with only one winner (4.2%) and five ties (20.8%). The sample of two-goal games perhaps gives a little hope to the Timbers, but these small sample sizes lend themselves to large margins of error.
It is also important to note that teams that go down two goals at home tend to be bad teams—like Chivas USA, which litters that particular data set. None of the five teams that ever went down three goals at home made the playoffs this year. Only seven of the 24 teams to go down two goals at home made it to the playoffs. Portland is a good team. Depending on your model of preference, the Timbers are somewhere in the top eight. So even if those probabilities up there hypothetically had small margins of error, they still wouldn’t necessarily apply to the Timbers.
Oh, and while we’re talking about extra variables, in those games the teams had less time to come back. To work around these confounding variables, I consulted a couple models, and I controlled for team ability using our expected goal differential. Here’s what I found.
A logistic model suggests that, for each goal of deficit early in a match, the odds of winning are reduced by a factor of about two or three. A tie, though, would also allow Portland to play on. A home team’s chances winning or tying fall from about 75 percent in a typical game that begins zero-zero, to about 25 percent being down two goals. Down three goals, and that probability plummets to less than 10 percent. But using this particular logistic regression was dangerous, as I was forced to extrapolate for situations that never happen during the regular season—starting a game from behind.
So I went to a linear model. The linear model expects Portland to win by about 0.4 goals. 15.5 percent of home teams in our model were able to perform at least 1.6 goals above expectation, what the Timbers would need to at least force a draw in regulation. Only 4.6 percent of teams performed 2.6 goals above expectation. If we just compromise between what the two models are telling us, then the Timbers probably have about a 20-percent chance to pull off a draw in regulation. That probability would have been closer to five percent had Piquionne not finished a beautiful header in stoppage time.
Two articles in particular inspired me this past week—one by Steve Fenn at the Shin Guardian, and the other by Mark Taylor at The Power of Goals. Steve showed us that, during the 2013 season, the expected goal differentials (xGD) derived from the shot locations data were better than any other statistics available at predicting outcomes in the second half of the season. It can be argued that statistics that are predictive are also stable, indicating underlying skill rather than luck or randomness. Mark came along and showed that the individual zones themselves behave differently. For example, Mark’s analysis suggested that conversion rates (goal scoring rates) are more skill-driven in zones one, two, and three, but more luck-driven or random in zones four, five, and six.
Piecing these fine analyses together, there is reason to believe that a partially regressed version of xGD may be the most predictive. The xGD currently presented on the site regresses all teams fully back league-average finishing rates. However, one might guess that finishing rates in certain zones may be more skill, and thus predictive. Essentially, we may be losing important information by fully regressing finishing rates to league average within each zone.
I assessed the predictive power of finishing rates within each zone by splitting the season into two halves, and then looking at the correlation between finishing rates in each half for each team. The chart is below:
Wow. This surprised me when I saw it. There are no statistically significant correlations—especially when the issue of multiple testing is considered—and some of the suggested correlations are actually negative. Without more seasons of data (they’re coming, I promise), my best guess is that finishing rates within each zone are pretty much randomly driven in MLS over 17 games. Thus full regression might be the best way to go in the first half of the season. But just in case…
I grouped zones one, two, and three into the “close-to-the-goal” group, and zones four, five, and six into the “far-from-the-goal” group. The results:
Okay, well this is interesting. Yes, the multiple testing problem still exists, but let’s assume for a second there actually is a moderate negative correlation for finishing rates in the “far zone.” Maybe the scouting report gets out by mid-season, and defenses close out faster on good shooters from distance? Or something else? Or this is all a type-I error—I’m still skeptical of that negative correlation.
Without doing that whole song and dance for finishing rates against, I will say that the results were similar. So full regression on finishing rates for now, more research with more data later!
But now, piggybacking onto what Mark found, there does seem to be skill-based differences in how many total goals are scored by zone. In other words, some teams are designed to thrive off of a few chances from higher-scoring zones, while others perhaps are more willing to go for quantity over quality. The last thing I want to check is whether or not the expected goal differentials separated by zone contain more predictive information than when lumped together.
Like some of Mark’s work implied, I found that our expected goal differentials inside the box are very predictive of a team’s actual second-half goal differentials inside the box—the correlation coefficient was 0.672, better than simple goal differential which registered a correlation of 0.546. This means that perhaps the expected goal differentials from zones one, two, and three should get more weight in a prediction formula. Additionally, having a better goal differential outside the box, specifically in zones five and six, is probably not a good thing. That would just mean that a team is taking too many shots from poor scoring zones. In the end, I went with a model that used attempt difference from each zone, and here’s the best model I found.*
|Zones 1, 3, 4||1.66||0.29|
|Zones 5, 6||-1.11||0.41|
*Extremely similar results to using expected goal differential, since xGD within each zone is a linear function of attempts.
The R-squared for this model was 0.708, beating out the model that just used overall expected goal differential (0.650). The zone that stabilized fastest was zone two, which makes sense since about a third of all attempts come from zone two. Bigger sample sizes help with stabilization. For those curious, the inputs here were attempt differences per game over the first seventeen games, and the response output is predicted total goal differential in the second half of the season.
Not that there is a closed-the-door conclusion to this research, but I would suggest that each zone contains unique information, and separating those zones out some could strengthen predictions by a measurable amount. I would also suggest that breaking shots down by angle and distance, and then kicked and headed, would be even better. We all have our fantasies.
Last night we talked about the eight teams still in the playoffs in a round-robin-style discussion, and then followed up the playoff talk with a general discussion about numbers. Specifically we talked about often-quoted and used statistics that don’t really hold any value. I also pretty much alienate all lawyers who listen to the podcast. Enjoy!
There was quite a popular tweet from a canine about New York’s improved play this season when Jamison Olave was playing.
There are obviously confounding factors at play here, not to mention small sample sizes. There were only seven matches this season in which Olave did not start, and eight in which he played 45 minutes or less. Any data obtained from these games is going to be subject to A) small sample sizes, B) lots of variance in the response variable (goals or wins), and C) no control for quality of opponent or location of the match.
To deal with the small sample size/variance problem, I’m going to use our now semi-famous data set on shot location origins. Steven Fenn kindly showed the world their predictive value, and to me that means that expected goals for and against are the most stable stat available for such an analysis. To control for New York’s opponents—when Olave was both in and out of the starting XI—I have included each of New York’s opponent’s expected goals data in the linear regression, while also accounting for whether or not the Red Bulls were at home. Blah, blah, blah, to the results!
Looking at the defensive side, New York allowed shots leading to 0.24 fewer expected goals against in games that Olave started. That seems to indicate New York’s need for Olave, but the p-value was a kind-of-high 26 percent. Overall, New York’s expected goal differential climbed 0.19 goals in those games that Olave started, though again, the p-value was quite high at 46 percent.*
Now for your shitty conclusion, courtesy of shitty p-values: Olave’s influence on New York’s level of play this season was questionable. There is some suggestion that he helped reduce goal-scoring against, however there is a reasonable chance that that difference was due to other, not-measured-here variables. What I am more comfortable claiming is that he does not make a 0.86-goal difference on the defensive side.
The point is this. New York’s shot creation and goal scoring ability, for and against, are more a function of whether or not the Red Bulls are home, and against whom they are playing. Not as much whether Olave starts. Obviously putting an inferior player into the starting XI isn’t going to help New York out. But, as I always question, do we really know how to value soccer players at all? Maybe Olave just doesn’t make that much of a difference. After all, he’s only one of eleven players.
*For those curious, the number of minutes Olave played was a worse predictor variable than the simple binary variable of whether or not he started. Controlling for the strength of opponent was necessary since perhaps Mr. Petke was more likely to sit Olave against a worse opponent at home, or something like that.