Traveling Salesman Problem in R with Location Data

Being locked down gave me an idea:

If I could travel, whats the shortest distance to visit all Australian football grounds?

To answer this question, I’m going to be exploring one of the most famous optimisation problems in mathematics: the Traveling Salesman Problem, or a Travelling Footballer Problem.

The data

Here’s some data I prepared earlier, 17 venues with corresponding latitude and longitude. These are Australian based venues that have been used in the last 3 years of the AFL regular season.

library(dplyr)
library(tidyr)
afl_venue_locations <- read.csv("post_data/afl_venue_locations.csv")
library(leaflet)

leaflet(data = afl_venue_locations) %>% addTiles() %>%
  addMarkers(~longitude, ~latitude, popup = ~venue, label = ~venue)

Given these 17 locations, and we want to find the shortest path that allows us to visit each of the grounds and finish where we started. To start off we need a distance matrix, a calculation of the distance between each venue. We are going to use the geosphere package with built in functions to calculate distances from the venues’ latitude and longitude coordinates.

The data prep

library(geosphere)

afl_venue_coords <- afl_venue_locations %>% select(longitude, latitude)

distance_matrix <- as.matrix(
  distm(afl_venue_coords, fun = distHaversine)
)/1000 #convert metres to kilometres

rownames(distance_matrix) <- afl_venue_locations$venue
colnames(distance_matrix) <- afl_venue_locations$venue

Our distance matrix has been computed. You’ll notice that the diagonals are all 0, which makes sense given its the same location. Also the matrix is symmetrical along the diagonal: Walking from the MCG to Docklands is the same as the return journey. Please note that this is a “As the crow flies” implementation, a straight line from venue to venue.

Model Building

I chose to use the ompr implementation of the TSP, which I found to be pretty stable and quick*. We will also need the development version of ompr to get access to the vectorised version of the algorithm.

*I haven’t tried any others.

install.packages("ompr")
devtools::install_github("dirkschumacher/ompr.roi") #or cran version higher than 0.8.0.9
install.packages("ROI.plugin.glpk")
library(ompr)
#specify the dimensions of the distance matrix
n <- length(afl_venue_locations$id)

#create a distance extraction function
dist_fun <- function(i, j) {
    vapply(seq_along(i), function(k) distance_matrix[i[k], j[k]], numeric(1L))
}

model <- MILPModel() %>%
    # we create a variable that is 1 iff we travel from city i to j
    add_variable(x[i, j], i = 1:n, j = 1:n, 
                 type = "integer", lb = 0, ub = 1) %>%
    
    # a helper variable for the MTZ formulation of the tsp
    add_variable(u[i], i = 1:n, lb = 1, ub = n) %>% 
    
    # minimize travel distance
    set_objective(sum_expr(colwise(dist_fun(i, j)) * x[i, j], i = 1:n, j = 1:n), "min") %>%
    
    # you cannot go to the same city
    set_bounds(x[i, i], ub = 0, i = 1:n) %>%
    
    # leave each city
    add_constraint(sum_expr(x[i, j], j = 1:n) == 1, i = 1:n) %>%
    
    # visit each city
    add_constraint(sum_expr(x[i, j], i = 1:n) == 1, j = 1:n) %>%
    
    # ensure no subtours (arc constraints)
    add_constraint(u[i] >= 2, i = 2:n) %>% 
    add_constraint(u[i] - u[j] + 1 <= (n - 1) * (1 - x[i, j]), i = 2:n, j = 2:n)
model
## Mixed integer linear optimization problem
## Variables:
##   Continuous: 17 
##   Integer: 289 
##   Binary: 0 
## Model sense: minimize 
## Constraints: 306

The model has been built, now lets solve it.

#Solving the model

library(ompr.roi)
library(ROI.plugin.glpk)

result <- solve_model(model, with_ROI(solver = "glpk", verbose = TRUE))
## <SOLVER MSG>  ----
## GLPK Simplex Optimizer, v4.47
## 306 rows, 306 columns, 1330 non-zeros
##       0: obj =  0.000000000e+000  infeas = 5.000e+001 (34)
## *    54: obj =  2.453647438e+004  infeas = 0.000e+000 (1)
## *   128: obj =  8.936395032e+003  infeas = 0.000e+000 (1)
## OPTIMAL SOLUTION FOUND
## GLPK Integer Optimizer, v4.47
## 306 rows, 306 columns, 1330 non-zeros
## 289 integer variables, 272 of which are binary
## Integer optimization begins...
## +   128: mip =     not found yet >=              -inf        (1; 0)
## +   877: >>>>>  1.603882389e+004 >=  9.723428934e+003  39.4% (82; 1)
## +  1144: >>>>>  1.507632658e+004 >=  9.723917868e+003  35.5% (109; 5)
## +  2068: >>>>>  1.507157320e+004 >=  1.051733544e+004  30.2% (158; 31)
## +  2215: >>>>>  1.167150030e+004 >=  1.069026063e+004   8.4% (163; 32)
## +  3581: >>>>>  1.166830513e+004 >=  1.160505429e+004   0.5% (11; 353)
## +  3671: mip =  1.166830513e+004 >=     tree is empty   0.0% (0; 399)
## INTEGER OPTIMAL SOLUTION FOUND
## <!SOLVER MSG> ----
result_val <- round(objective_value(result), 2)
result_val
## [1] 11668.31

Nice! The shortest distance travelled to visit all venues is 11668km.

Results

How does this route look on a map?

solution <- get_solution(result, x[i, j]) %>% 
    filter(value > 0)

paths <- select(solution, i, j) %>% 
    rename(from = i, to = j) %>% 
    mutate(trip_id = row_number()) %>% 
    inner_join(afl_venue_locations, by = c("from" = "id"))

paths_leaflet <- paths[1,]
paths_row <- paths[1,]

for (i in 1:n) {
    paths_row <- paths %>% filter(from == paths_row$to[1])
    
    paths_leaflet <- rbind(paths_leaflet, paths_row)
}

leaflet() %>% 
    addTiles() %>%
    addMarkers(data = afl_venue_locations, ~longitude, ~latitude, popup = ~venue, label = ~venue) %>% 
    addPolylines(data = paths_leaflet, ~longitude, ~latitude, weight = 2)

Seems pretty intuitive, essentially a big loop around Australia.

Maths behind the TSP

How many combinations were there? Lets start with a simple case of 4 locations and the following path, ending where we started.

\(A \rightarrow B \rightarrow C \rightarrow D \rightarrow A\)

is the same as

\(B \rightarrow C \rightarrow D \rightarrow A \rightarrow B\)

This means the “starting venue” is already decided, we can just apply a rotation to the cycle. We can also travel in either direction (thanks to a symmetric matrix).

\(A \rightarrow B \rightarrow C \rightarrow D \rightarrow A\)

is the same as

\(A \rightarrow D \rightarrow C \rightarrow B \rightarrow A\)

which means we can remove half of the possible cycles, which are just reflections of each other.

This gives us the resulting formula:

\[ \texttt{NComb} = \dfrac{(n-1)!}{2}\\ n=17, \therefore \texttt{NComb} = \dfrac{(17-1)!}{2} = 10,461,394,944,000 \]

To brute force a 17 venue TSP, 10 trillion combinations would need to be evaluated, even when removing cycling and reflections, and the ompr R implementation solved it in about 1 second! Amazing!

Bonus Round

What’s the maximum distance? Easy. Replace the objective from min to max.

model_max <- MILPModel() %>%
    # we create a variable that is 1 iff we travel from city i to j
    add_variable(x[i, j], i = 1:n, j = 1:n, 
                 type = "integer", lb = 0, ub = 1) %>%
    
    # a helper variable for the MTZ formulation of the tsp
    add_variable(u[i], i = 1:n, lb = 1, ub = n) %>% 
    
    # maximise travel distance
    set_objective(sum_expr(colwise(dist_fun(i, j)) * x[i, j], i = 1:n, j = 1:n), "max") %>%
    
    # you cannot go to the same city
    set_bounds(x[i, i], ub = 0, i = 1:n) %>%
    
    # leave each city
    add_constraint(sum_expr(x[i, j], j = 1:n) == 1, i = 1:n) %>%
    
    # visit each city
    add_constraint(sum_expr(x[i, j], i = 1:n) == 1, j = 1:n) %>%
    
    # ensure no subtours (arc constraints)
    add_constraint(u[i] >= 2, i = 2:n) %>% 
    add_constraint(u[i] - u[j] + 1 <= (n - 1) * (1 - x[i, j]), i = 2:n, j = 2:n)
model_max
## Mixed integer linear optimization problem
## Variables:
##   Continuous: 17 
##   Integer: 289 
##   Binary: 0 
## Model sense: maximize 
## Constraints: 306
result_max <- solve_model(model_max, with_ROI(solver = "glpk", verbose = TRUE))
## <SOLVER MSG>  ----
## GLPK Simplex Optimizer, v4.47
## 306 rows, 306 columns, 1330 non-zeros
##       0: obj =  0.000000000e+000  infeas = 5.000e+001 (34)
## *    54: obj =  2.453647438e+004  infeas = 0.000e+000 (1)
## *   166: obj =  3.605414246e+004  infeas = 0.000e+000 (1)
## OPTIMAL SOLUTION FOUND
## GLPK Integer Optimizer, v4.47
## 306 rows, 306 columns, 1330 non-zeros
## 289 integer variables, 272 of which are binary
## Integer optimization begins...
## +   166: mip =     not found yet <=              +inf        (1; 0)
## +   181: >>>>>  3.605414246e+004 <=  3.605414246e+004 < 0.1% (2; 0)
## +   181: mip =  3.605414246e+004 <=     tree is empty   0.0% (0; 3)
## INTEGER OPTIMAL SOLUTION FOUND
## <!SOLVER MSG> ----
result_val_max <- round(objective_value(result_max))

paste0('Total distance: ',result_val_max,'km')
## [1] "Total distance: 36054km"
solution_max <- get_solution(result_max, x[i, j]) %>% 
    filter(value > 0)

paths_max <- select(solution_max, i, j) %>% 
    rename(from = i, to = j) %>% 
    mutate(trip_id = row_number()) %>% 
    inner_join(afl_venue_locations, by = c("from" = "id"))

paths_max_leaflet <- paths_max[1,]
paths_max_row <- paths_max[1,]

for (i in 1:n) {
    paths_max_row <- paths_max %>% filter(from == paths_max_row$to[1])
    
    paths_max_leaflet <- rbind(paths_max_leaflet, paths_max_row)
}

leaflet() %>% 
    addTiles() %>%
    addMarkers(data = afl_venue_locations, ~longitude, ~latitude, popup = ~venue, label = ~venue) %>% 
    addPolylines(data = paths_max_leaflet, ~longitude, ~latitude, weight = 2)

Lots of criss-crossing paths. Imagine travelling from Canberra to Sydney via Perth.

Conclusion

Thanks for making it this far, you can check out the script I used to generate this analysis here. You can ask me questions about analysis at my twitter @crow_data_sci.

Make sure to check out ompr on github and follow the creator of this package on twitter, @dirk_sch.

Share