Improving my Expected Goals model

Previously I have built a very simple expected goals model based on four buckets for shots – six yard box, penalty area, outside the area, penalties. This is an improvement on pure shot numbers but still fairly crude. Here I describe my attempts to refine the model.

Note: many people have already tried similar concepts, this is not new. I am just trying to produce my own model to aid analysis of games.

Improving the model

One of the constraints on any model is the data. I have obtained more granular data and used it to build two more sophisticated models.

I now have more detailed position data, whether an attempt is a header or not, whether it is a direct free kick, and whether it came from a cross or not.

The two families of model are broadly:

1. Probability based

This is just an extension of my original model but with more buckets.

2. Logistic Regression

Here I have encoded a shots location through the distance from goal and the angle. I have then used logistic regression to build a parameterised probability model. I separated out headers from other shots and regressed these separately.

I used data from shots from the last six years to find the probabilities / parameters.

Testing the model

To test these two models to find which best describes a game I decided to test how well they correlated with actual goals over a set of games.

I split each team’s games for each season into chunks of six. This gave 720 groups of six games. I then tested the linear correlation between the expected goals numbers for those games and the actual goal numbers.

I also compared compared points, both with a ratio of expected goals (xG for / (xG for + xG against)), and an Expected Points measure based on Monte Carlo simulations of the shot probabilities.

Here are the R Squared values – I have included pure shot numbers for comparison:

xG_rsquared

The regression based Expected Goals model outperforms the probability based measure in all categories, although not by much.

Both xG measures perform significantly better than shots.

Conclusions

Both versions of Expected Goals give a better description of a match than simple shot numbers. The linear regression version is marginally better than the pure probability version.

Further work

Next, I would like to test out the predictive capabilities of these measures.

There is plenty of scope for improving the model – taking into account score effects, the coordinates of the pass that set up the chance etc. I would also like to try replacing logistic regression with a Neural Network based model.

Follow me on twitter for statistical tweets – ABPNumbers

 

 

Advertisements

2 thoughts on “Improving my Expected Goals model

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

w

Connecting to %s