The world cup is almost once more amongst us, which means interminable weeks of breathless coverage, punditry, and heartfelt professions that each match will be played at 110%. In an effort to inject some more quantitative rigour to a field which, apparently, could do with some, let’s try and predict how the whole thing will play out.

The world cup consists of a group stage, where 8 groups of 4 countries play a mini-league, followed by the knockout rounds where the top 2 countries in each group play up to 4 rounds to win the cup. Each country can enter the knockout stage in one of two places, depending on the group they are in, and the other countries in each group con strongly influence the likelihood of making it out of the group stage – being stuck with Germany and Brazil might mean an early flight home from Russia for example.

Simulating a football match is no simple thing, there are an incredible number of variables to take into account for any given match – current state of the team, friendliness of the crowd, weather conditions, historical levels of rivalry, injuries…. I’m sure some very sophisticated analysis is done by betting firms, but I’m going to skip all of that and look at one number.

There is a very useful website here which has helpfully calculated the up-to-date Elo rating of each national football team. This is a single number which indicates the expected performance of each team. To calculate the probability that one team beats another, simply calculate

where are the Elo ratings of the two teams. Simple!

A difference in Elo score of a few hundred or more means that one team has a significantly higher probability of winning than the other.

**Simulations**

With this overly simplified view in hand, I then simulated a million world cups. I *ignored the possibility of draws* for now, and for each game just flipped a coin weighted by the Elo scores of the two teams. It was therefore possible in this setup for Japan to beat Brazil, but not very likely! (6%).

Without further ado then, let’s look at the odds this system comes up with for overall winner:

- Brazil – 26%
- Germany – 19%
- Spain – 15%
- France – 7%
- Argentina – 6%
- Portugal – 5%
- England – 5%
- Belgium – 3%
- Colombia – 3%
- Peru – 2%

and compare with current odds offered by online betting companies:

**Brazil – 22-25%****Germany – 20-22%****Spain – 16-20%**- France – 14-16%
- Argentina – 10-13%
- Belgium – 9-10%
**England – 5-7%****Portugal – 4-5%**- Uruguay – 3-4%
- Croatia – 2.5-4%

The bolded countries are those where I was within a percent or so of the official odds – not too shabby given the simplicity of the model! I think it’s safe to say that we expect a Germany-Brazil match at some point…

Of course, I have the entire world cup simulated so there is lots more detail to extract. In the following large image, I have plotted the probabilities that the given countries will participate in each match (click to enlarge):

It is interesting that some matches almost certainly have their entrants pre-determined, e.g. the winner of group E will very likely be Brazil, so the Round of 16 match containing the winner of E will feature Brazil with 65% probability:

On the other hand, some matches are much less well determined, like the first quarter final:

For a given country, we can plot their likely route through the entire proceedings.

Let’s look at how the world cup will play out for England:

It looks like England should get through the groups, with less than a 10% chance of flunking out early, but they’ll probably lose their first or second knockout match (as usual).

Why might this be?

Ah yes, that’s right, Brazil. Brazil and England can both take 2 routes through the cup, but if they meet it will definitely be at the quarter finals. And as you can clearly see above by the mass of orange, Brazil will probably steamroller right through. It is also more likely for Brazil to win the final, than to lose at any previous stage.

Plotting Germany as well though, there is a different option for the world up:

It will probably be the case that Brazil takes the top route, and Germany the bottom. It is therefore Germany which demolish England at the quarters, and then don’t meet Brazil until a thrilling final.

Whatever happens, I’m now confident in the knowledge that I can play along with any football chat I might be dragged into over the summer, with the requisite stats to back it up.

Intuitively I would expect the average of many simulations to simply order the teams following their Elo scores – is this the case? If not, why – e.g. does the order of teams within the initial groups prevent it?

LikeLike

And you’d be absolutely correct for the top ten here! It’s a good question that I haven’t spent any time thinking about to be honest. One thing I wonder is whether ‘groups of death’ make it less likely than you’d expect that a team leaves the group stages. Over many simulations perhaps that effect averages out.

LikeLike

Hi Jason!

I love your blog – just the sort of content I love to dive into myself.

I as wondering what you used to plot the diagrams – it looks like pyplot but how did you get the lines in multiple different places? Was it a custom plotting function?

Cheers

Dev

LikeLike

Hey, thanks, I’m glad you like my blog!

It was indeed pyplot. To plot the fuzzy lines I generated the coordinates for a given line, then plotted 50 lines at low opacity with the line vertices randomly offset by a small amount.

LikeLike

Hi Jason, did you update the Elo scores after each game?

LikeLike

I didn’t, though I do wonder what Mexico’s is right now!

LikeLike