Using R for everything…

While R is a popular software in Biology, it can also be used in many other fields. This lab will not be detailled in class and it is there only to illustrate some of the things that can be done with R outside of biology.

Here, we will look at an application of R to analyzing sports statistics. More precisely, soccer statistics!

Most of the code/data for this lab comes from: https://towardsdatascience.com/how-to-visualize-football-data-using-r-ee963b3a0ba4 which uses data from https://statsbomb.com/

Only a small subset of the data collected by BombStats is available for free, but it is enough to illustrate the possibilities.

Required packages

Let’s start by installing some of the required packages. Because some of these packages are not in the (official) CRAN repository, they need to be installed/compiled using devtools. You will also need to install GIT on your system: https://git-scm.com/

# If not already installed, make sure to install the packages devtools and tidyverse:
install.packages('devtools')
install.packages('tidyverse')


library(devtools)
devtools::install_github("statsbomb/SDMTools")
devtools::install_github("statsbomb/StatsBombR")
install.packages('ggsoccer')
library(tidyverse)
library(StatsBombR)
library(ggsoccer)

# Retrieve all available competitions
Comp <- FreeCompetitions()

# Filter the competition (UEFA Champion's league 2012/2013 final)
ucl_german <- Comp %>%
  filter(competition_id==16 & season_name=="2012/2013")


# Retrieve all available matches
matches <- FreeMatches(ucl_german)

# Retrieve the event data
events_df <- get.matchFree(matches)

# Preprocess the data
clean_df <- allclean(events_df)

# You can do the same thing without the intermediate variables by using pipes (%>%):
clean_df <- FreeCompetitions() %>% filter(competition_id==16 & season_name=="2012/2013") %>% FreeMatches() %>% get.matchFree() %>% allclean()

# Passing Map
muller_pass <- clean_df %>%
  filter(player.name == 'Thomas Müller') %>%
  filter(type.name == 'Pass')

ggplot(muller_pass) +
  annotate_pitch(dimensions = pitch_statsbomb) +
  geom_segment(aes(x=location.x, y=location.y, xend=pass.end_location.x, yend=pass.end_location.y),
               colour = "coral",
               arrow = arrow(length = unit(0.15, "cm"),
                             type = "closed")) +
  labs(title="Thomas Muller's Passing Map",
       subtitle="UEFA Champions League Final 12/13",
       caption="Data Source: StatsBomb")

Pressure heat map:

# Pressure Heat Map
bayern_pressure <- clean_df %>%
  filter(team.name == 'Bayern Munich') %>%
  filter(type.name == 'Pressure')

ggplot(bayern_pressure) +
  annotate_pitch(dimensions = pitch_statsbomb, fill='#021e3f', colour='#DDDDDD') +
  geom_density2d_filled(aes(location.x, location.y, fill=..level..), alpha=0.4, contour_var='ndensity') +
  scale_x_continuous(c(0, 120)) +
  scale_y_continuous(c(0, 80)) +
  labs(title="Bayern Munich's Pressure Heat Map",
       subtitle="UEFA Champions League Final 12/13",
       caption="Data Source: StatsBomb") + 
  theme_minimal() +
  theme(
    plot.background = element_rect(fill='#021e3f', color='#021e3f'),
    panel.background = element_rect(fill='#021e3f', color='#021e3f'),
    plot.title = element_text(hjust=0.5, vjust=0, size=14),
    plot.subtitle = element_text(hjust=0.5, vjust=0, size=8),
    plot.caption = element_text(hjust=0.5),
    text = element_text(family="Geneva", color='white'),
    panel.grid = element_blank(),
    axis.title = element_blank(),
    axis.text = element_blank(),
    legend.position = "none"
  )

Home-side advantage?

Are teams more likely to win when playing at home than on the road?

# Using all the competitions with free data (minus international competitions, for which home/away is decided on a random draw)
competitions <-  FreeCompetitions() %>% filter(competition_international == F)
## [1] "Whilst we are keen to share data and facilitate research, we also urge you to be responsible with the data. Please credit StatsBomb as your data source when using the data and visit https://statsbomb.com/media-pack/ to obtain our logos for public use."
matches <- FreeMatches(Competitions = competitions)
## [1] "Whilst we are keen to share data and facilitate research, we also urge you to be responsible with the data. Please credit StatsBomb as your data source when using the data and visit https://statsbomb.com/media-pack/ to obtain our logos for public use."
home_wins <- length(which(matches$home_score>matches$away_score))
away_wins <- length(which(matches$home_score<matches$away_score))
ties <- length(which(matches$home_score==matches$away_score))
total <- home_wins + away_wins + ties

prop.test( c(home_wins, away_wins), c(total, total) )
## 
##  2-sample test for equality of proportions with continuity correction
## 
## data:  c(home_wins, away_wins) out of c(total, total)
## X-squared = 18.268, df = 1, p-value = 1.919e-05
## alternative hypothesis: two.sided
## 95 percent confidence interval:
##  0.04992755 0.13571838
## sample estimates:
##    prop 1    prop 2 
## 0.4535885 0.3607656