Mapping and analyzing a tornadoes dataset


Part 1 - Displaying the data on maps

Download the file tornadoes.csv and load it into your R environment with the read.csv function:


t <- read.csv("data/tornadoes.csv")


First, let’s inspect the data and plot the location of tornadoes since 2020 on a map.


library(ggmap)
library(knitr)

kable(head(t))
date time tz st mag lat lon len wid
10/1/1950 21:00:00 3 OK 1 36.73 -102.52 15.8 10
10/9/1950 2:15:00 3 NC 3 34.17 -78.60 2.0 880
11/20/1950 2:20:00 3 KY 2 37.37 -87.20 0.1 10
11/20/1950 4:00:00 3 KY 1 38.20 -84.50 0.1 10
11/20/1950 7:30:00 3 MS 1 32.42 -89.13 2.0 37
11/4/1950 17:00:00 3 PA 3 40.20 -76.12 15.9 100
# The column "date" is read as a character by R. You can check this with:
class(t$date)
## [1] "character"
# Preparing a column with the date as an actual date object
t$Date <- as.Date(t$date, format = "%m/%d/%Y")

# This is now a Date object:
class(t$Date)
## [1] "Date"
# We will need to isolate the year for some later work...
t$year <- as.numeric(format(t$Date, "%Y"))

# Preparing the coordinates of the borders for our map:
map_bounds = c(left = -129, bottom = 23, right = -60, top = 50)
myMap <- get_map(location=map_bounds, zoom = 4, source="stamen")
plot(myMap)

# Some of the latitude/longitude in the dataframe are abherant.
# Let's keep only the data that falls in our mapping area:
tr <- t[which(t$lat>=map_bounds["bottom"] & t$lat<=map_bounds["top"] & 
                t$lon>=map_bounds["left"] & t$lon<=map_bounds["right"])
        ,
        ]

ggMap <- ggmap(myMap, extent="device", legend="none")

# Selecting only the data for the tornadoes since January 1st, 2020
trd <- t[which(t$Date > as.Date("2020-01-01", format = "%Y-%m-%d")) , ]
ggMapWithDots <-  ggMap + 
  geom_point(data=trd,  aes(x=lon, y=lat, size = len), fill="red", shape=21, alpha=0.8)
plot(ggMapWithDots)


Let’s animate this map with a display of tornadoes for every first year of each decade:

library(gganimate)

trd <- tr[which(tr$year%%10 == 0) , ]

ggMapWithDots <-  ggMap + 
  geom_point(data=trd,  aes(x=lon, y=lat, size = len), fill="red", shape=21, alpha=0.8) +
  transition_states(year, transition_length=10, state_length=30) +
  labs(title = 'Year: {closest_state}')

animate(ggMapWithDots)


Part 2 - Statistics


A current topic of research is: does climate change cause an increase in the frequency of extreme weather events? We are not going to answer this (complex) question in this lab. However, we will see how we can use statistics to attach confidence levels to some observations.


Let’s start with the question: have tornadoes become dangerous in the past decades? Our proxy for this will be the recorded length of the tornado track (column len) in the dataframe.

We can start with a visual inspection of the data with a boxplot:

# I start by removing all the records with a track length <= 0 (this is missing data)
t$decade <- t$year - t$year %% 10

tr <- t[which(t$len>0) , ]

boxplot(tr$len~tr$decade)


The spread of the data makes it very difficult to see anything. The solution? Use a log-transformation:

boxplot( log(tr$len)~tr$decade)


Much better, right? But it still does not answer our question.

Let’s use a statistical test to compare the length of tornado tracks in the 2010s vs the 2020s. Our null hypothesis is: tornado track lengths in the 2020s were not bigger than in the 2010s.


t.test(tr$len[which(tr$decade==2010)] , tr$len[which(tr$decade==2020)], alternative = "less")
## 
##  Welch Two Sample t-test
## 
## data:  tr$len[which(tr$decade == 2010)] and tr$len[which(tr$decade == 2020)]
## t = -2.1273, df = 3172.7, p-value = 0.01673
## alternative hypothesis: true difference in means is less than 0
## 95 percent confidence interval:
##         -Inf -0.08715215
## sample estimates:
## mean of x mean of y 
##  3.725735  4.110381


The t-test used here makes a number of assumptions regarding our data. An alternative way to compute a p-value (and free yourself from these assumptions) is to perform random drawings. In this case, the general idea is as follows:

1. We observe that the average length of tornadoe tracks in the 2020s was x

  1. If I randomly shuffle the track lengths from the 2010s and 2020s and randomly draw from this shuffle the same number of points as there were observations in the 2020s, what are the odds to observe a mean value >= x?

We will answer this point by performing a number of random shuffles:

mean_len_recent <- mean(t$len[which(t$decade==2020)])

tr <- t[which( is.element(t$decade, c(2010, 2020))==T ) , ]

NB_SIMULATIONS <- 10000
v_simulation <- numeric(NB_SIMULATIONS)
for(i in (1:NB_SIMULATIONS)){
  tr$sy <- sample(tr$year)
  simulated_average <- mean(tr$len[which(tr$sy>=2020)])
  v_simulation[i] <- simulated_average
  if( i %% 1000 == 0){ print(i) }
}
## [1] 1000
## [1] 2000
## [1] 3000
## [1] 4000
## [1] 5000
## [1] 6000
## [1] 7000
## [1] 8000
## [1] 9000
## [1] 10000
hist(v_simulation)
rug(x = mean_len_recent, lwd = 4, col = "red")

length(which(v_simulation>=mean_len_recent)) / NB_SIMULATIONS
## [1] 0.0107