Previously I have built a very simple expected goals model based on four buckets for shots – six yard box, penalty area, outside the area, penalties. This is an improvement on pure shot numbers but still fairly crude. Here I describe my attempts to refine the model.
Note: many people have already tried similar concepts, this is not new. I am just trying to produce my own model to aid analysis of games.
Improving the model
One of the constraints on any model is the data. I have obtained more granular data and used it to build two more sophisticated models.
I now have more detailed position data, whether an attempt is a header or not, whether it is a direct free kick, and whether it came from a cross or not.
The two families of model are broadly:
1. Probability based
This is just an extension of my original model but with more buckets.
2. Logistic Regression
Here I have encoded a shots location through the distance from goal and the angle. I have then used logistic regression to build a parameterised probability model. I separated out headers from other shots and regressed these separately.
I used data from shots from the last six years to find the probabilities / parameters.
Testing the model
To test these two models to find which best describes a game I decided to test how well they correlated with actual goals over a set of games.
I split each team’s games for each season into chunks of six. This gave 720 groups of six games. I then tested the linear correlation between the expected goals numbers for those games and the actual goal numbers.
I also compared compared points, both with a ratio of expected goals (xG for / (xG for + xG against)), and an Expected Points measure based on Monte Carlo simulations of the shot probabilities.
Here are the R Squared values – I have included pure shot numbers for comparison:
The regression based Expected Goals model outperforms the probability based measure in all categories, although not by much.
Both xG measures perform significantly better than shots.
Both versions of Expected Goals give a better description of a match than simple shot numbers. The linear regression version is marginally better than the pure probability version.
Next, I would like to test out the predictive capabilities of these measures.
There is plenty of scope for improving the model – taking into account score effects, the coordinates of the pass that set up the chance etc. I would also like to try replacing logistic regression with a Neural Network based model.
Follow me on twitter for statistical tweets – ABPNumbers