Introduction to AFL Analytics with Useless AFL Stats

2022-08-10 · 3324 words · 16 minute read

AFL Data Tutorial

R · Useless AFL Stats · Introduction to R

I’m co-admin of a little page on Facebook that caters to a niche audience of AFL statistics nerds known as Useless AFL Stats, where (founder and co-admin) Aaron and I discover stats that have no relevance to anything at all, and will never be useful to anyone, ever. And that got me thinking, why should I have all the fun finding these nuggets of gold?

I’m a firm believer in open source programming and open data, the philosophy that as much data should be made publicly available as possible to the largest possible audience. A thousand or so brains are going to be more innovative, better at analysing trends, faster at fact checking (and creating useless AFL stats) than just 2.

That’s why I’ll be taking you on a journey from beginner to expert, to give you the tools to be the master of your own AFL data analysis. Key points will be covered in this first post:

Setup of your coding workspace
- R installation
- Rstudio installation and setup
Basics of R
Downloading AFL data
Creating your first AFL stat

If you have any questions during this tutorial, you can tweet at me @crow_data_sci. If you’ve got a useless stat, feel free to tweet at us @UselessStatsAFL or message our facebook page.

Ok, let get into the setup!

R and RStudio Installation

The program that we’ll be using to generate stats is R, a powerful tool commonly used by professional statisticians and data scientists in academia and industry, but we’ll be using it to shitpost about AFL stats.

We’ll be downloading R from here https://cloud.r-project.org/ and choose the version for your OS. I’d recommend all the default file locations and setup. If you are using a MAC OS, click R-4.2.x.pkg which is the executable.

We’ll also be using RStudio, a Graphical User Interface (GUI) that allows for easier use of the R language. Download the free version here at https://rstudio.com/products/rstudio/download/.

Once they’ve been downloaded, open up your newly installed RStudio program. It should automatically find R and open up an interface. Think of the R language as the frame, steering wheel and engine of a car. You can get pretty far with just that, but RStudio completes the car, adding all the bells and whistles for a more comfortable journey. God that’s a terrible analogy.

Basics of R

Anyway, you should see a screen and tab named ‘Console’. This is where all the the commands get executed, lets try a couple out.

1+1

## [1] 2

2*4

## [1] 8

3^3

## [1] 27

What you’ll see is that the answer has been calculated and result has been produced on the next line.

Let’s save these results, we may want to use them later. In R we use an assignment command <-, which looks like a backwards arrow. You can assign a result to any string of characters. You can also assign characters (like names, teams, locations, etc) to variables too.

a <- 1+1
b <- 2*4
c <- 3^3
first_name <- 'John'

If you look at the ‘Environment’ tab in the top right section, we can see our newly created variables. Variables can be used in commands with each other too. You can’t add strings to numbers though.

a + b * c

## [1] 218

Another important concept are vectors, a data structure that can hold multiple values. We use c() to denote a vector and we can insert multiple values.

d <- c(3,7,8,2)
teams <- c('WCE','Freo','Geel')
d

## [1] 3 7 8 2

d*2

## [1]  6 14 16  4

d[3]

## [1] 8

teams[2]

## [1] "Freo"

We can multiply and add to vectors and use square brackets to pull out certain indexes (positions) of the vector. Press the up arrow to cycle backwards through your previous commands.

Projects

Projects help contain all of your data and files in an easy to maintain structure. We’ll create one for all of our AFL data analysis.

Click the dropdown menu in the top right corner
New project (and save)
New Directory
New Project
Name it (AFL_Scripts or something similar)

This initialises your new project, and we’ll do all our analysis in this project. Use the command getwd() in the console to find out the file path of this directory. You should see something similar to C:/Users/your_name/Documents/AFL_Scripts.

Scripts

Scripts are an easy way to store commands that you want to come back to later, and all of our data analysis will be written in scripts. To create a new script:

File (top left)
New File
R Script

Writing commands in the script and pressing enter will not execute the command, but will take you to a newline. To execute a line, use Ctrl + Enter. Press Ctrl + Sto save your script, and you should see it appear in the ‘Files’ tab on the right of RStudio.

Let’s get into some AFL Analytics!

In our new script, we first need to install some packages and load them into our workspace by executing the following commands in the console (not the Script).

Installing packages

install.packages("devtools") #allows us to download from github
install.packages("dplyr")    #data manipulation tools
install.packages("tidyr")    #more data manipulation tools
install.packages("snakecase")#data cleaning tool
install.packages("hms")      #date formatting
install.packages("fitzRoy")  #get AFL data - mind the capital R
# devtools::install_github("jimmyday12/fitzRoy") #get the dev version - advanced

Loading packages

library(dplyr)
library(tidyr)
library(snakecase)
library(fitzRoy)

Here’s another analogy, think of install.packages as a light bulb and library as a switch. You only need to install a package once (unless you update R), and can use the library function to turn them on when needed.

Side note using a hash (#) is a programming technique called commenting. Anything after a # will not be run and it allows the programmer to add notes, like what a certain line or function does.

Loading in AFL data

Now that we have everything set up, we can dive right in to the stats. We are going to load in data from afltables.com using fitzRoy, an R package put together by James Day that contains most of the match data in a consistent structure.

Lets load in all the data from the year 2000 onwards and assign it to a variable.

afltables <- fetch_player_stats_afltables(season = 2021) #loads in 2021 data

## i Looking for data from 2021-01-01 to 2021-12-31

## 
i fetching cached data from <github.com>

v fetching cached data from <github.com> ... done
## i No new data found - returning cached data
## Finished getting afltables data

# afltables <- fetch_player_stats_afltables(season = 2000:2010) #loads in data from 2000 to 2010

Now that the data is loaded into our workspace, you can see in the ‘Environment’ tab that we have 9527 thousand rows (observations) and 59 columns (variables). We can confirm this with the dim (short for dimensions) function. Loading in the data from 1897 will have over 600k rows.

dim(afltables) #rows by columns

## [1] 9527   59

Lets use the head command, which shows the top 6 or so rows of our dataset.

head(afltables)

## # A tibble: 6 x 59
##   Season Round Date       Local.start.time Venue  Attendance Home.team  HQ1G
##    <dbl> <chr> <date>                <int> <chr>       <dbl> <chr>     <int>
## 1   2021 1     2021-03-18             1925 M.C.G.      49218 Richmond      3
## 2   2021 1     2021-03-18             1925 M.C.G.      49218 Richmond      3
## 3   2021 1     2021-03-18             1925 M.C.G.      49218 Richmond      3
## 4   2021 1     2021-03-18             1925 M.C.G.      49218 Richmond      3
## 5   2021 1     2021-03-18             1925 M.C.G.      49218 Richmond      3
## 6   2021 1     2021-03-18             1925 M.C.G.      49218 Richmond      3
## # ... with 51 more variables: HQ1B <int>, HQ2G <int>, HQ2B <int>, HQ3G <int>,
## #   HQ3B <int>, HQ4G <int>, HQ4B <int>, Home.score <int>, Away.team <chr>,
## #   AQ1G <int>, AQ1B <int>, AQ2G <int>, AQ2B <int>, AQ3G <int>, AQ3B <int>,
## #   AQ4G <int>, AQ4B <int>, Away.score <int>, First.name <chr>, Surname <chr>,
## #   ID <dbl>, Jumper.No. <chr>, Playing.for <chr>, Kicks <dbl>, Marks <dbl>,
## #   Handballs <dbl>, Goals <dbl>, Behinds <dbl>, Hit.Outs <dbl>, Tackles <dbl>,
## #   Rebounds <dbl>, Inside.50s <dbl>, Clearances <dbl>, Clangers <dbl>, ...

And use the command names to check the column names, to help get a sense of what this data holds.

names(afltables)

##  [1] "Season"                  "Round"                  
##  [3] "Date"                    "Local.start.time"       
##  [5] "Venue"                   "Attendance"             
##  [7] "Home.team"               "HQ1G"                   
##  [9] "HQ1B"                    "HQ2G"                   
## [11] "HQ2B"                    "HQ3G"                   
## [13] "HQ3B"                    "HQ4G"                   
## [15] "HQ4B"                    "Home.score"             
## [17] "Away.team"               "AQ1G"                   
## [19] "AQ1B"                    "AQ2G"                   
## [21] "AQ2B"                    "AQ3G"                   
## [23] "AQ3B"                    "AQ4G"                   
## [25] "AQ4B"                    "Away.score"             
## [27] "First.name"              "Surname"                
## [29] "ID"                      "Jumper.No."             
## [31] "Playing.for"             "Kicks"                  
## [33] "Marks"                   "Handballs"              
## [35] "Goals"                   "Behinds"                
## [37] "Hit.Outs"                "Tackles"                
## [39] "Rebounds"                "Inside.50s"             
## [41] "Clearances"              "Clangers"               
## [43] "Frees.For"               "Frees.Against"          
## [45] "Brownlow.Votes"          "Contested.Possessions"  
## [47] "Uncontested.Possessions" "Contested.Marks"        
## [49] "Marks.Inside.50"         "One.Percenters"         
## [51] "Bounces"                 "Goal.Assists"           
## [53] "Time.on.Ground.."        "Substitute"             
## [55] "Umpire.1"                "Umpire.2"               
## [57] "Umpire.3"                "Umpire.4"               
## [59] "group_id"

This next step I like to include cleans up some of the naming used, makes it more consistent format that is less likely to break a function later down the line.

#rename all the columns to a snakecase format
names(afltables) <- to_snake_case(names(afltables))
names(afltables) # now the column headers are in lowercase and have dots replaced with underscores

##  [1] "season"                  "round"                  
##  [3] "date"                    "local_start_time"       
##  [5] "venue"                   "attendance"             
##  [7] "home_team"               "hq_1_g"                 
##  [9] "hq_1_b"                  "hq_2_g"                 
## [11] "hq_2_b"                  "hq_3_g"                 
## [13] "hq_3_b"                  "hq_4_g"                 
## [15] "hq_4_b"                  "home_score"             
## [17] "away_team"               "aq_1_g"                 
## [19] "aq_1_b"                  "aq_2_g"                 
## [21] "aq_2_b"                  "aq_3_g"                 
## [23] "aq_3_b"                  "aq_4_g"                 
## [25] "aq_4_b"                  "away_score"             
## [27] "first_name"              "surname"                
## [29] "id"                      "jumper_no"              
## [31] "playing_for"             "kicks"                  
## [33] "marks"                   "handballs"              
## [35] "goals"                   "behinds"                
## [37] "hit_outs"                "tackles"                
## [39] "rebounds"                "inside_50_s"            
## [41] "clearances"              "clangers"               
## [43] "frees_for"               "frees_against"          
## [45] "brownlow_votes"          "contested_possessions"  
## [47] "uncontested_possessions" "contested_marks"        
## [49] "marks_inside_50"         "one_percenters"         
## [51] "bounces"                 "goal_assists"           
## [53] "time_on_ground"          "substitute"             
## [55] "umpire_1"                "umpire_2"               
## [57] "umpire_3"                "umpire_4"               
## [59] "group_id"

AFL Stats

Lets work towards two stats:

Who has the highest amount of disposals equal to their tackle count?

and:

Which team has the highest accuracy?

Selecting columns

Now that the data is loaded and in a format we can easily manipulate, lets take a look at some basic functions. dplyr has built in functions to make this process as painless as possible. Firstly, lets look at select, and function that keeps the columns we want to investigate. Also we are going to be making use of %>%, known as a pipe, to channel our data through various functions. A shortcut for the command is Ctrl + Shift + m.

afltables %>% 
  select(season, round, id, first_name, surname, kicks, handballs, tackles)

## # A tibble: 9,527 x 8
##    season round    id first_name surname  kicks handballs tackles
##     <dbl> <chr> <dbl> <chr>      <chr>    <dbl>     <dbl>   <dbl>
##  1   2021 1     12790 Jake       Aarts        7         5       2
##  2   2021 1     11828 David      Astbury      4         5       1
##  3   2021 1     12661 Liam       Baker        4        11       0
##  4   2021 1     12686 Noah       Balta       10         1       1
##  5   2021 1     12535 Shai       Bolton      13        12       1
##  6   2021 1     12456 Nathan     Broad        4         5       0
##  7   2021 1     12010 Josh       Caddy       12         5       2
##  8   2021 1     12431 Jason      Castagna     8         5       0
##  9   2021 1     11557 Shane      Edwards     11        16       3
## 10   2021 1     12576 Jack       Graham      22        11       3
## # ... with 9,517 more rows

Combining (mutating) columns

Nice, now we’ve got the data want to investigate, we can use a technique using a function called mutate. Whats interesting is this data source doesn’t have a disposals count column, but we can easily recreate it by adding handballs to kicks with one line of code.

afltables %>% 
  select(season, round, id, first_name, surname, kicks, handballs, tackles) %>% 
  mutate(disposals = kicks + handballs) #name of our new column goes on the left hand side

## # A tibble: 9,527 x 9
##    season round    id first_name surname  kicks handballs tackles disposals
##     <dbl> <chr> <dbl> <chr>      <chr>    <dbl>     <dbl>   <dbl>     <dbl>
##  1   2021 1     12790 Jake       Aarts        7         5       2        12
##  2   2021 1     11828 David      Astbury      4         5       1         9
##  3   2021 1     12661 Liam       Baker        4        11       0        15
##  4   2021 1     12686 Noah       Balta       10         1       1        11
##  5   2021 1     12535 Shai       Bolton      13        12       1        25
##  6   2021 1     12456 Nathan     Broad        4         5       0         9
##  7   2021 1     12010 Josh       Caddy       12         5       2        17
##  8   2021 1     12431 Jason      Castagna     8         5       0        13
##  9   2021 1     11557 Shane      Edwards     11        16       3        27
## 10   2021 1     12576 Jack       Graham      22        11       3        33
## # ... with 9,517 more rows

Filtering our data

Lets find all the times the disposal count was equal to the tackles. We can achieve this by using the filter function.

afltables %>% 
  select(season, round, id, first_name, surname, kicks, handballs, tackles) %>% 
  mutate(disposals = kicks + handballs) %>% 
  filter(disposals == tackles)

## # A tibble: 263 x 9
##    season round    id first_name surname   kicks handballs tackles disposals
##     <dbl> <chr> <dbl> <chr>      <chr>     <dbl>     <dbl>   <dbl>     <dbl>
##  1   2021 1     12545 Callum     Brown         0         0       0         0
##  2   2021 1     12748 Rhylee     West          0         0       0         0
##  3   2021 1     12756 Kade       Chandler      0         0       0         0
##  4   2021 1     12265 Tom        Cutler        0         0       0         0
##  5   2021 1     12857 Connor     Downie        0         0       0         0
##  6   2021 1     12443 Rhys       Mathieson     0         0       0         0
##  7   2021 1     12509 Will       Hayward       0         0       0         0
##  8   2021 1     12865 Charlie    Lazzaro       1         0       1         1
##  9   2021 1     12821 Xavier     OHalloran     0         0       0         0
## 10   2021 1     12312 Mason      Wood          0         0       0         0
## # ... with 253 more rows

Arranging by a column

Ok, so we have all the occurrences when tackles was equal to disposals, what was the largest? We can use the arrange function on a column to sort by ascending or descending order. The default arrangement for a columns is ascending (smallest at the top to biggest), so we’ll wrap the column name in desc() to get the descending order.

afltables %>% 
  select(season, round, id, first_name, surname, kicks, handballs, tackles) %>% 
  mutate(disposals = kicks + handballs) %>% 
  filter(disposals == tackles) %>% 
  arrange(desc(disposals))

## # A tibble: 263 x 9
##    season round    id first_name surname kicks handballs tackles disposals
##     <dbl> <chr> <dbl> <chr>      <chr>   <dbl>     <dbl>   <dbl>     <dbl>
##  1   2021 16    12076 Dayne      Zorko       9         3      12        12
##  2   2021 17    12771 Kysaiah    Pickett     6         3       9         9
##  3   2021 21    12904 Kieren     Briggs      6         3       9         9
##  4   2021 9     12905 Ronin      OConnor     1         7       8         8
##  5   2021 16    11994 Scott      Lycett      5         3       8         8
##  6   2021 PF    12695 Willem     Drew        5         3       8         8
##  7   2021 8     12849 Sam        Berry       3         4       7         7
##  8   2021 9     12637 Jamaine    Jones       3         4       7         7
##  9   2021 13    12485 Mabior     Chol        4         3       7         7
## 10   2021 17    12596 Lachie     Fogarty     3         4       7         7
## # ... with 253 more rows

And there we have it, your first AFL stat. You should see Dayne Zorko up the top with 12 disposals and tackles in round 16.

Group by and Summarise

Grouping is a powerful tool we use to group certain values in columns. An example of this would be season, where each year is essentially its own category, and we can run commands that (for example) take the average amount of goals per team. Lets put this into practice with a simple example based off data we already have answering the following question:

Which team has the highest accuracy?

Lets pull in the data we need to create this stat. We need to sum the total goals and behinds per team.

afltables %>% 
  select(playing_for, goals, behinds) %>% 
  group_by(playing_for) %>% 
  summarise(
    sum_g = sum(goals),
    sum_b = sum(behinds),
    .groups = 'drop'
  )

## # A tibble: 18 x 3
##    playing_for            sum_g sum_b
##    <chr>                  <dbl> <dbl>
##  1 Adelaide                 230   197
##  2 Brisbane Lions           333   222
##  3 Carlton                  250   201
##  4 Collingwood              225   166
##  5 Essendon                 291   200
##  6 Fremantle                219   220
##  7 Geelong                  295   213
##  8 Gold Coast               201   180
##  9 Greater Western Sydney   279   190
## 10 Hawthorn                 239   145
## 11 Melbourne                323   242
## 12 North Melbourne          213   157
## 13 Port Adelaide            294   213
## 14 Richmond                 253   183
## 15 St Kilda                 237   184
## 16 Sydney                   303   195
## 17 West Coast               257   168
## 18 Western Bulldogs         339   238

Now that we have the total counts per team, we can use mutate to calculate accuracy, which is \(\dfrac{Goals}{Shots}\). We are also going to arrange the result to see which team has the highest accuracy in 2021.

afltables %>% 
  select(playing_for, goals, behinds) %>% 
  group_by(playing_for) %>% 
  summarise(
    sum_g = sum(goals),
    sum_b = sum(behinds),
    .groups = 'drop'  #we also need to drop the grouping after running the command
  ) %>% 
  mutate(
    accuracy = sum_g/(sum_g+sum_b)*100 #multiply by 100 to get a %
  ) %>% 
  arrange(desc(accuracy))

## # A tibble: 18 x 4
##    playing_for            sum_g sum_b accuracy
##    <chr>                  <dbl> <dbl>    <dbl>
##  1 Hawthorn                 239   145     62.2
##  2 Sydney                   303   195     60.8
##  3 West Coast               257   168     60.5
##  4 Brisbane Lions           333   222     60  
##  5 Greater Western Sydney   279   190     59.5
##  6 Essendon                 291   200     59.3
##  7 Western Bulldogs         339   238     58.8
##  8 Geelong                  295   213     58.1
##  9 Richmond                 253   183     58.0
## 10 Port Adelaide            294   213     58.0
## 11 North Melbourne          213   157     57.6
## 12 Collingwood              225   166     57.5
## 13 Melbourne                323   242     57.2
## 14 St Kilda                 237   184     56.3
## 15 Carlton                  250   201     55.4
## 16 Adelaide                 230   197     53.9
## 17 Gold Coast               201   180     52.8
## 18 Fremantle                219   220     49.9

Nice, we can see that the year 2000 has the highest accuracy. Another question we might ask is which team in each round had the highest accuracy? We can group by a second variable, round.

afltables %>% 
  select(playing_for, round, goals, behinds) %>% 
  group_by(playing_for, round) %>% 
  summarise(
    sum_g = sum(goals),
    sum_b = sum(behinds),
    .groups = 'drop'
  ) %>% 
  mutate(
    accuracy = sum_g/(sum_g+sum_b)*100
  ) %>% 
  arrange(desc(accuracy))

## # A tibble: 414 x 5
##    playing_for            round sum_g sum_b accuracy
##    <chr>                  <chr> <dbl> <dbl>    <dbl>
##  1 Hawthorn               21       15     2     88.2
##  2 West Coast             4        13     2     86.7
##  3 Greater Western Sydney 3        11     2     84.6
##  4 Richmond               17       11     2     84.6
##  5 Adelaide               6        16     3     84.2
##  6 Hawthorn               13       14     3     82.4
##  7 Western Bulldogs       18       14     3     82.4
##  8 Carlton                20       18     4     81.8
##  9 Adelaide               19       16     4     80  
## 10 Hawthorn               5         8     2     80  
## # ... with 404 more rows

Hawthorn in Round 21 topping the charts with a whopping 88% accuracy. group_by and summarise work really well together, and you can switch out sum with mean for the average, or max and min for the maximum and minimum in each group.

You could also switch out playing_for with id,first_name and surname to get individual player’s accuracy. Another variation is to import the data from 2000 to 2022. The possibilities are endless.

Conclusion

Thanks for making it this far, hopefully its given you a taste of the potential insights (useful or useless) that R and AFL have to offer. This is hopefully the first in a series of tutorials about AFL analytics in R. You can contact us on twitter or Facebook, let us know what you found interesting, insightful, difficult, any other types of stats you’d like to see recreated in upcoming posts. @crow_data_sci