Brazil is a better team than you think.

Admittedly, it hasn’t felt like Brazil has played all that well this World Cup. The referee seemingly made its two-goal victory over Croatia a more relaxed finish than it should have been; against Mexico, the fourth-place team from CONCACAF, it only managed a draw; and Cameroon was just low-hanging fruit. The host team then took a lot of flak for its play in the Round of 16 against Chile, especially for its performance after halftime. Indeed, Brazil conceded a silly goal on a defensive giveaway, and Chile had chances to win that game.

But I’m here to tell you that Brazil has played better that it has looked. Too often, it seems, the scorelines heavily influence our praise and criticism of what’s happening on the field.

Brazil dominated Group A in terms of Expected Goal Differential (xGD), and recorded the second-highest tally of any team during the group stage. Brazil’s 1.05 xGD during even (tied) gamestates ranked fifth among the 32 teams. You might have expected better from the hosts, but most teams only played about 130 minutes in such gamestates. That’s a big enough sample size to get a general idea of which are the best teams, but too small a sample to split hairs over the top five.

Croatia – June 12th

Against the Croats, a penalty awarded to Fred on what appeared to be a dive marred what was actually a solid performance by Brazil. Up to that controversial call, Brazil had earned 1.4 Expected Goals (xGoals) to Croatia’s 0.4, dominating in quantity and quality of shots. Even after taking the lead on the penalty, Brazil still edged Croatia in xGoals the rest of the way, 0.30 to 0.24—a differential that matches what we’d expect of teams that were leading in this tournament.

Mexico – June 17th

Mexico is a better team than their last-second World Cup qualification (and that commentator) would suggest. It led the CONCACAF Hexagonal (the Hex!) in shot ratios and is currently ranked 13th in the world in the Soccer Power Index (though some of that improved ranking is because of their tie against Brazil). Despite a disappointing 0 – 0 tie on the scoreboard, Brazil’s 1.4 xGoals again dwarfed that of its opponents. Mexico totaled just 0.5 xGoals.

Cameroon – June 23rd

There’s not much to say about this one. Brazil’s 1.9 xGD against Cameroon was the third highest discrepancy thus far in the tournament, trailing only France’s drubbing of Honduras and Germany’s handling of Portugal. It should be noted that both France and Germany enjoyed a man advantage for the majorities of those games.

Chile – June 28th

For Chile, the scoreboard and their well-developed rapport with the woodwork are clear indications that they could have won this game. However, the opportunity creation department informs us that Brazil probably should have won, as it did. 94 percent of this game’s shots were taken during an even gamestate, either 0 – 0 or 1 – 1, and Brazil outpaced Chile during that time by a full expected goal. Even after halftime, when Brazil looked disorganized and sloppy, it still edged Chile 1.1-to-0.7 in xGoals.


Perhaps Brazil has not “looked” the part of tournament favorites during its first four games, but its shot creation numbers suggest it is definitely playing like one of the best teams. Add that to their pre-tournament resume, throw in the home-field advantage that’s not going away anytime soon, and there is little doubt that Brazil is still the favorite to win this World Cup—maybe not with a majority of the probability, but definitely with a plurality.

World Cup Statistics

We have begun rolling out World Cup statistics in the same format as those we provide for MLS. Scroll over “World Cup 2014” along the top bar to check it out!

In the Team Stats Tables, one may observe that the recently-eliminated Spain outshot its opponents, and a much higher proportion of its possession occurred in the attacking third than that of its opponents.

Our team-by-team Expected Goals data shows that England played better than its results would suggest, earning more dangerous opportunities than its opponents. It was a matter of inches for Wayne Rooney a few times there…


Finishing data suggests that Lionel Messi has made the most of his opportunities—surprise, surprise—but did you know that none of Thomas Muller’s seven shots were assisted?

And despite giving up a tournament-high seven goals in the group stages, our Goalkeeping Data actually suggests that  Honduran goalkeeper Noel Valladares performed admirably—especially considering the onslaught of shots he faced that were worth a tournament-most 0.4 goals per shot on target.

USA versus Ghana: Gamestates Analysis

In analyzing MLS shot data, I have learned that—with small sample sizes—how a team plays when the game is tied is a strong indication of how well it will do in future games. The US Mens National Team spent just four-and-a-half minutes tied Monday evening, the epitome of small sample sizes. In case you were curious, the US generated two shots during that time worth about 0.13 goals. Ghana did not generate a shot over those 4.5 minutes.

The next most-important gamestate for a team is being ahead. With at least 17 games of data in MLS, knowing how well a team did when it was leading becomes an important piece of information for predicting that team’s future success. Almost 95 minutes were spent with the US in the lead, a time in which the USMNT took six shots worth 0.5 goals to Ghana’s 21 shots worth 1.7 goals.

Though MLS is definitely far below the level of even a USA-versus-Ghana match, I think a lot of the statistics from our MLS database still apply. I wrote a few weeks back about how away teams that were satisfied with the current gamestate went overboard with their conservative play. I think that could apply to the World Cup, as well. By most statistical accounts, USA versus Ghana was a fairly even matchup going in, yet the US played an annoying conservative style after going up a goal early. It gave up a majority of possession to Ghana in the attacking third, completing just 81 passes to Ghana’s 171 in that zone—not to mention the US being tripled up in Expected Goals when it was ahead.

Granted, Expected Goals likely overestimates the losing team’s chances of scoring. But not by much. In even gamestates in MLS, we see that teams are expected to score 1.29 goals per game, and they actually score 1.30 goals per game. Virtually no difference. However, when teams are ahead they are expected to score 1.79 goals per game, yet they only score about 1.60—an 11-percent drop. This discrepancy is likely due in large part to defenses being more packed in and capable of blocking shots. Indeed, teams that are losing have their shots blocked 27 percent of the time, while teams that are winning only have their shots blocked 22 percent of the time.

All that was simply to say that Ghana’s 1.7 Expected Goals are still representative of a team that was in control—too much control for my comfort level. Even if we assume it was really about 1.5 Expected Goals against a defensive-minded American side, that still triples the USA’s shot potential. Either the US strategy was overly conservative, or Ghana is really that much better. I’d like to believe in the former, but it’s picking between the lesser of two evils.

It just doesn’t make sense to me to play conservatively to maintain the status quo. It invariably leads to massive discrepancies in Expected Goals, and too often allows the opposition an easier way to come back.

Sporting KC still has edge in the capital

If you come in from a certain angle, you can hype this evening’s DC United-Sporting KC game as the Eastern Conference’s clash of the week. The two teams enter this game tied for the second seed with two of the best goal differentials in the conference. With DCU playing at home, and Sporting missing half its team, the edge would appear to go to United. But not so fast.

Despite being inseparable by points, DCU and Sporting are about as far apart as two teams can be by Expected Goal Differential. Sporting sits atop the league at +0.62 per game,* while DCU is ahead of only San Jose with -0.33. If we look to even gamestates—during only those times when the score was tied and the teams were playing 11-on-11—the chasm between them grows even wider. Sporting’s advantage over DCU in Even xGD is more than 1.5 goals per game.*

To this point, as early as it is in the season, I have found that winners are best predicted by Even xGD, rather than overall goal differential. Though the sample size of shots is smaller for each team in these scenarios, the information is less clouded by the various tactics that are employed when one team goes ahead, or when one team loses a player.

Of course, Sporting will be missing the likes of Graham Zusi, Matt Besler, and Lawrence Olum, as they have for the past three games. The loss of those key players has mostly coincided with their current four-game winless stretch, and it would be tempting to argue that they are not in form. However, over those last three games, Sporting overall xGD is +0.27 per game,* and its Even xGD is +0.68.*

Making predictions in sports is generally just setting oneself up for failure—especially in a sport where there are three outcomes—but I will say this. Sporting is likely better than the +180 betting line I’m seeing this morning.

*I use the phrase “per game” for simplicity, but xGD is actually calculated on a per-minute basis in our season charts. Per game implies per 96 minutes, which is the average length of an MLS game.

Calculating Expected Goals 2.0

I wrote a post similar to this a while back, outlining the process for calculating our first version of Expected Goals. This is going to be harder. Get out your TI-89 calculators, please. (Or you can just used my Expected Goals Cheatsheet).

Expected Goals is founded on the idea that each shot had a certain probability of going in based on some important details about that shot. If we add up all the probabilities of a team’s shots, that gives us its Expected Goals. Our goal is that this metric conveys the quality of opportunities a team earns for itself. For shooters and goal keepers, the details about the shot change a little bit, so pay attention.

The formulas are all based on a logistic regression, which allows us to sort out the influence of each shot’s many details all at once. The formula changes slightly each week because we base the regression on all the data we have, including each week’s new data, but it won’t change by much.

Expected Goals for a Team

  • Start with -0.19
  • Subtract 0.95 if the shot was headed (0.0 if it was kicked or othered).
  • Subtract 0.74 if the shot was taken from a corner kick (by Opta definition)
  • Subtract one of the following amounts for the shot’s location:
    Zone 1 – 0.0
    Zone 2 – 0.93
    Zone 3 – 2.37
    Zone 4 – 2.68
    Zone 5 – 3.55
    Zone 6 – 3.06

Now you have what are called log odds of that shot going in. To find the odds of that shot going in, put the log odds in an exponent over the number “e”. 

Finally, to find the estimated probability of that shot going in, take the odds and divide by 1 + odds.

Example: Shot from zone 3, header, taken off a corner kick:

-0.19 – 0.95 – 0.74 – 2.37 = -4.25

e^(-4.25) = .0143

.0143 / (1 + .0143) = 0.014 or a 1.4% chance of going in.

A team that took one of these shots would earn 0.014 expected goals.

Expected Goals for Shooter

  • Start with -0.28
  • Subtract 0.83 if the shot was headed (0.0 if it was kicked or othered).
  • Subtract 0.65 if the shot was taken from a corner kick (by Opta definition).
  • Add 2.54 if the shot was as a penalty kick.
  • Add 0.71 if the shot was taken on a fastbreak (by Opta definition).
  • Add 0.16 if the shot was taken from a set piece (by Opta definition).
  • Subtract one of the following amounts for the shot’s location:
  1. 0.0
  2. 1.06
  3. 2.32
  4. 2.61
  5. 3.48
  6. 2.99

Now you have what are called log odds of that shot going in. To find the odds of that shot going in, put the log odds in an exponent over the number “e”. 

Finally, to find the estimated probability of that shot going in, take the odds and divide by 1 + odds

Example: A penalty kick

-0.28 + 2.54 – 1.06 = 1.2
e^(1.2) = 3.320
3.320/ (1 + 3.320) = 0.769 or a 76.9% chance of going in.
A player that took a penalty would gain an additional 0.769 Expected Goals. If he missed, then he be underperforming his Expected Goals by 0.769.

Expected Goals for Goalkeeper

*These are calculated only from shots on target.

  • Start with 1.61
  • Subtract 0.72 if the shot was headed (0.0 if it was kicked or othered).
  • Add 1.58 if the shot was as a penalty kick.
  • Add 0.42 if the shot was taken from a set piece (by Opta definition).
  • Subtract one of the following amounts for the shot’s location:
  1. One) 0.0
  2. Two) 1.10
  3. Three) 2.57
  4. Four) 2.58
  5. Five) 3.33
  6. Six) 3.21
  • Subtract 1.37 if the shot was taken toward the middle third of the goal (horizontally).
  • Subtract 0.29 if the shot was taken at the lower half of the goal (vertically).
  • Add 0.35 if the was taken outside the width of the six-yard box and was directed toward the far post.

Now you have what are called log odds of that shot going in. To find the odds of that shot going in, put the log odds in an exponent over the number “e”. 

Finally, to find the estimated probability of that shot going in, take the odds and divide by 1 + odds

Example: Shot from zone 2, kicked toward lower corner, from the run of play.

1.61 – 1.10 – 0.29 = 0.22

e^(0.22) = 1.246

1.246/ (1 + 1.246) = 0.555 or a 55.5% chance of going in.

A keeper that took on one of these shots would gain an additional 0.555 Expected Goals against. If he saved it, then he would be outperforming his Expected Goals by 0.555.

Frequently Asked Questions

1. Why a regression  model? Why not just subset each shot in a pivot table by its type across all variables?
I think a lot of information–degrees of freedom we call it–would be lost if I were to partition each shot into a specific type by location, pattern of play, body part, and for keepers, placement. The regression gets more information about, say, headed shots in general, rather than “headed shots from zone 2 off corner kicks,” of which there are far fewer data points.
2. Why don’t you include info about penalty kicks in the team model?
Penalty kicks are not earned in a stable manner. Teams that get lots of PK’s early in the season are no more likely to get additional PK’s later in the season. Since we want this metric to be predictive at the team level, including penalty kicks would cloud that prediction for teams that have received an extreme number of PK’s thus far.
3. The formula looks quite a bit different for shooters versus for keepers. How is that possible since one is just taking a shot on the other?
There are a few reasons for this. The first is that the regression model for keepers is based only on shots on target. It is meant only to assess their ability to produce quality saves. A different data set leads to different regression results. Also, we are now accounting for the shooter’s placement. It is very possible that corner kicks are finished less often than shots from other patterns of play because they are harder to place. By including shot placement information in the keeper model, the information about whether the shot came off a corner is now no longer needed for assessing the keeper’s ability.
4. Why don’t you include placement for shooters, then?
We wish to assess a shooter’s ability to create goals beyond what’s expected. Part of that skill is placement. When a shooter has recorded more goals than his expected goals, it indicates a player that is outperforming his expectation. It could be because he places well, or that he is deceptive, or he is good at getting opportunities that are better than what the model thinks. In any case, we want the expected goals to reflect the opportunities earned, and thus the actual goals should help us to measure finishing ability to some extent.


Looking for the model-busting formula

Well that title is a little contradictory, no? If there’s a formula to beat the model then it should be part of the model and thus no longer a model buster. But I digress. That article about RSL last week sparked some good conversation about figuring out what makes one team’s shots potentially worth more than those of another team. RSL scored 56 goals (by their own bodies) last season, but were only expected to score 44, a 12-goal discrepancy. Before getting into where that came from, here’s how our Expected Goals data values each shot:

  1. Shot Location: Where the shot was taken
  2. Body part: Headed or kicked
  3. Gamestate: xGD is calculated in total, and also specifically during even gamestates when teams are most likely playing more, shall we say, competitively.
  4. Pattern of Play: What the situation on the field was like. For instance, shots taken off corner kicks have a lower chance of going in, likely due to a packed 18-yard box. These things are considered, based on the Opta definitions for pattern of play.

But these exclude some potentially important information, as Steve Fenn and Jared Young pointed out. I would say, based on their comments, that the two primary hindrances to our model are:

  1. How to differentiate between the “sub-zones” of each zone. As Steve put it, was the shot from the far corner of Zone 2, more than 18 yards from goal? Or was it from right up next to zone 1, about 6.5 yards from goal?
  2. How clean a look the shooter got. A proportion of blocked shots could help to explain some of that, but we’re still missing the time component and the goalkeeper’s positioning. How much time did the shooter have to place his shot and how open was the net?

Unfortunately, I can’t go get a better data set right now so hindrance number 1 will have to wait. But I can use the data set that I already have to explore some other trends that may help to identify potential sources of RSL’s ability to finish. My focus here will be on their offense, using some of the ideas from the second point about getting a clean look at goal.

Since we have information about shot placement, let’s look at that first. I broke down each shot on target by which sixth of the goal it targeted to assess RSL’s accuracy and placement. Since the 2013 season, RSL is second in the league in getting its shots on goal (37.25%), and among those shots, RSL places the ball better than any other team. Below is a graphic of the league’s placement rates versus those of RSL over that same time period. (The corner shots were consolidated for this analysis because it didn’t matter to which corner the shot was placed.)

Placement Distribution - RSL vs. League


RSL obviously placed shots where the keeper was not likely at: the corners. That’s a good strategy, I hear. If I include shot placement in the model, RSL’s 12-goal difference in 2013 completely evaporates. This new model expected them to score 55.87 goals in 2013, almost exactly the 56 they scored.

Admittedly, it isn’t earth-shattering news that teams score by shooting at the corners, but I still think it’s important. In baseball, we sometimes assess hitters and pitchers by their batting average on balls in play (BABIP), a success rate during specific instances only when the ball is contacted. It’s obvious that batters with higher BABIPs will also have higher overall batting averages, just like teams that shoot toward the corners will score more goals.

But just because it is obvious doesn’t mean that this information is worthless. On the contrary, baseball’s sabermetricians have figured out that BABIP takes a long time to stabilize, and that a player who is outperforming or underperforming his BABIP is likely to regress. Now that we know that RSL is beating the model due to its shot placement, this begs the question, do accuracy and placement stabilize at the team level?

To some degree, yes! First, there is a relationship between a team’s shots on target totals from the first half of the season and the second half of the season. Between 2011 and 2013, the correlation coefficient for 56 team-seasons was 0.29. Not huge, but it does exist. Looking further, I calculated the differences between teams’ expected goals in our current model and teams’ expected goals in this new shot placement model. The correlation from first half to second half on that one was 0.54.

To summarize, getting shots on goal can be repeated to a small degree, but where those shots are placed in the goal can be repeated at the team level. There is some stabilization going on. This gives RSL fans hope that at least some of this model-busting is due to a skill that will stick around.

Of course, that still doesn’t tell us why RSL is placing shots well as a team. Are their players more skilled? Or is it the system that creates a greater proportion of wide-open looks?

Seeking details that may indicate a better shot opportunity, I will start with assisted shots. A large proportion of assisted shots may indicate that a team will find open players in front of net more often, thus creating more time and space for shots. However, an assisted shot is no more likely to go in than an unassisted one, and RSL’s 74.9-percent assist rate is only marginally better than the league’s 73.1 percent, anyway. RSL actually scored about six fewer goals than expected on assisted shots, and six more goals than expected on unassisted shots. It becomes apparent that we’re barking up the wrong tree here.*

Are some teams more capable of not getting their shots blocked? If so then then those teams would likely finish better than the league average. One little problem with this theory is that RSL gets it shots blocked more often than the league average. Plus, in 2013, blocked shot percentages from the first half of the season had a (statistically insignificant) negative correlation to blocked shots in the second half of the season, suggesting strongly that blocked shots are more influenced by randomness and the defense, rather than by the offense which is taking the shots.

Maybe some teams get easier looks by forcing rebounds and following them up efficiently. Indeed, in 2013 RSL led the league in “rebound goals scored” with nine, where a rebounded shot is one that occurs within five seconds of the previous shot. That beat their expected goals on those particular shots by 5.6 goals. However, earning rebounds does not appear to be much of a skill, and neither does finishing them. The correlation between first-half and second-half rebound chances was a meager–and statistically insignificant–0.13, while the added value of a “rebound variable” to the expected goals model was virtually unnoticeable. RSL could be the best team at tucking away rebounds, but that’s not a repeatable league-wide skill. And much of that 5.6-goal advantage is explained by the fact that RSL places the ball well, regardless of whether or not the shot came off a rebound.

Jared did some research for us showing that teams that get an extremely high number of shots within a game are less likely to score on each shot. It probably has something to do with going for quantity rather than quality, and possibly playing from behind and having to fire away against a packed box. While that applies within a game, it does not seem to apply over the course of a season. Between 2011 and 2013, the correlation between a teams attempts per game and finishing rate per attempt was virtually zero.

If RSL spends a lot of time in the lead and very little time playing from behind–true for many winning teams–then its chances may come more often against stretched defenses. RSL spent the fourth most minutes in 2013 with the lead, and the fifth fewest minutes playing from behind. In 2013, there was a 0.47 correlation between teams’ abilities to outperform Expected Goals and the ratio of time they spent in positive versus negative gamestates.

If RSL’s boost in scoring comes mostly from those times when they are in the lead, that would be bad news since their Expected Goals data in even gamestates was not impressive then, and is not impressive now. But if the difference comes more from shot placement, then the team could retain some of its goal-scoring prowess. 8.3 goals of that 12-goal discrepancy I’m trying to explain in 2013 came during even gamestates, when perhaps their ability to place shots helped them to beat the expectations. But the other 4-ish additional goals likely came from spending increased time in positive gamestates. It is my guess that RSL won’t be able to outperform their even gamestate expectation by nearly as much this season, but at this point, I wouldn’t put it past them either.

We come to the unsatisfying conclusion that we still don’t know exactly why RSL is beating the model. Maybe the players are more skilled, maybe the attack leaves defenses out of position, maybe it spent more time in positive gamestates than it “should have.” And maybe RSL just gets a bunch of shots from the closest edge of each zone. Better data sets will hopefully sort this out someday.

*This doesn’t necessarily suggest that assisted shots have no advantage. It could be that assisted shots are more commonly taken by less-skilled finishers, and that unassisted shots are taken by the most-skilled finishers. However, even if that is true, it wouldn’t explain why RSL is finishing better than expected, which is the point of this article.