Previously I created an Expected Goals model based on logistic regression.
I wanted to improve this model. Rather than add new features and work out how to include them in the regression equations, I decided a simpler way would be to use a Machine Learning algorithm to do it for me. So I decided to convert my model to use a Neural Network.
The advantage to using a Neural Network is that I can add more features and retrain the model and it should automatically learn how to incorporate these into the calculation. The disadvantage is that the process becomes a bit more opaque.
Building the Neural Network
There are various libraries available that provide Machine Learning tools but I decided to implement my own, partly as a learning exercise and partly because I wanted to have more understanding of the processes involved.
I built a deep Neural Network in Java. I also built an implementation of mini-batch back propagation with gradient decent to train the network. For the activation function I am using the Logistic function.
I split the training set – shots from the last six seasons of the Premier League – randomly into two sets, 70% going into the training set and 30% into the verification set. I then trained the network through several iterations on the former and tested the network’s performance on the latter. This is a common technique to prevent over-fitting of the training set.
Slightly disappointingly the results were very similar to those from the Logistic Regression based model – perhaps not surprising given the same training data and inputs.
Extensions To The Model
Once I had a working Neural Network it was easy to add more features and retrain. Apart from some normalisation, I just added the data as is, without the need to hand craft formulas.
The additional data mainly included more data about the assist, if there was one.
The Big Chance dilemma
Some data companies mark particularly good opportunities with a Big Chance tag. The problem is, unlike say shot position, this is highly subjective. It is also prone to outcome bias – if the goal is scored this must affect the view of the person coding the data.
When looking for more information on this I found a very interesting article comparing Expected Goals models from Nils Mackay – see here. (This also gave me a new way of evaluating my model – see below).
This article and the others linked to it, are well worth a read and will point you to some of the originators of the Expected Goals concept, along with some of the most successful implementations. I think it has convinced me that using this tag is a bad idea, even if it does give a short cut to seemingly improved performance.
For interest I trained a Neural Network that uses the Big Chances tag and one that doesn’t.
Evaluating the models
I evaluated performance on unseen data – the 90 games so far in this season’s Premier League. This is a little small but should still give us a good idea of the relative performance.
Nils Mackay’s acrticle mentioned above makes a good case for using RMSEP (Root Mean Square Error Percentage) to test performance, so I decided to try this and the old staple – R-Squared.
Note: This is just looking at individual games, so I don’t expect a huge correlation. Expected Goals is measured against real goals, and Expected Points (from simulations of the shots) against real points. A pure shots model is included for comparison.
As expected, all models significantly outperform the pure shots model.
The Logistic Regression model and the initial Neural Network model have very similar performance.
The improved Neural Network is a small but significant improvement on the Logistic Regression model. This is as expected – it has more input data to describe each shot but the most important information was already part of the model.
Adding the Big Chance tag gives a big boost but it is questionable whether this is a good thing.
A Neural Network is a convenient way to build a reasonable Expected Goals model. It makes it easy to add new features, although maybe makes the calculation less transparent.
I intend to move to the improved Neural Network model for analysing games but not using the Big Chance tag. I will publish some examples here next.
Follow me on twitter: