Introduction
The Brownlow medal, one of the most prestigious AFL awards, is notoriously difficult to predict. You can read this article from Fat Stats for a great look as to why. Here’s a quick summary:
- Umpires are human and have their own biases
- Highly predictive stats from last year may not be predictive this year
- Media influence may impact umpires when they vote
Another point that I’d like to throw into the mix:
Only 3 out of 44 players can actually receive a vote!
6% of players per game receive a vote! Further more, 1 out of 44 receive the 3 votes. This is a ridiculously small sample size, and predicting rare events is one of the trickiest things to do in predictive modelling. I’m gonna try anyway.
The plan
In this post I’ll be exploring the underlying structure of the data that players accumulate per game. To do do this I’ll be employing 2 different techniques that employ dimensionality reduction. Dimensionality reduction is a way of representing high dimensional datasets in 2 or 3 dimensions that retain as much information about the original dataset as possible.
Once a low dimensional representation has been calculated and visualised, we expect to see different structures between 0 vote games and vote worthy games. Here are the two methods I’ll be using.
- Principal Component Analysis (PCA, linear, fairly common, robust)
- Uniform Manifold Approximation and Projection (UMAP, non-linear, relatively new, tricky)
PCA reduces the dimensions of the dataset while preserving as much of the total variability as possible. PCA outputs a linear combination of vectors, where each principal component explains a proportion of variability in the dataset. 1
UMAP is a non-linear dimension reduction technique and uses a fuzzy topological structure to model the manifold. The end result is a low dimensional projection of the data that is closest to its equivalent fuzzy topological structure. 2
Using each method, we can compare the visualisations and make judgments about which method better represents the structure of the data, and helps us differentiate between vote worthy games.
What do I mean by dimensionality reduction?
Each statistic that you can record in the AFL is a dimension (ie kicks, marks and handballs). Lets plot these on a 3D chart and colour each point by the bronwlow votes they received.
This is a 3 dimensional graph where each point is a player from round 1 2019, corresponding to their stats for that game. Humans can only realistically understand 3 spatial dimension, but computers and algorithms have the luxury of analysing 100’s of dimensions, a fact that we can exploit. You can hover over points and see who the player is and their corresponding stats.
In reality, the AFL records hundreds of data points including free kicks, frees against, bounces, defensive one on ones and contested posessions, just to name a few. We can’t visualise all these at once, so we need a representation of this data in a form we can understand. Dimensionality reduction is the process of taking these 100’s of dimensions and leaving us with 2 (or 3) that most accurately correspond to its higher dimensional shape.
The first attempts
Principal component analysis (no scaling)
For the next sections I’ll be using the 2018 Home and Away Season, and sourcing the data from fitzRoy. I’ll be using prcomp
from stats
to generate the principal components.
These are the statistics that are going to be used in the following analyses: kicks, marks, handballs, goals, behinds, hit_outs, tackles, rebounds, inside_50_s, clearances, clangers, frees_for, frees_against, contested_possessions, uncontested_possessions, contested_marks, marks_inside_50, one_percenters, bounces, goal_assists, time_on_ground.