Download the file tornadoes.csv and load it into your R environment with the read.csv function:
t <- read.csv("data/tornadoes.csv")
First, let’s inspect the data and plot the location of tornadoes
since 2020 on a map.
library(ggmap)
library(knitr)
kable(head(t))
date | time | tz | st | mag | lat | lon | len | wid |
---|---|---|---|---|---|---|---|---|
10/1/1950 | 21:00:00 | 3 | OK | 1 | 36.73 | -102.52 | 15.8 | 10 |
10/9/1950 | 2:15:00 | 3 | NC | 3 | 34.17 | -78.60 | 2.0 | 880 |
11/20/1950 | 2:20:00 | 3 | KY | 2 | 37.37 | -87.20 | 0.1 | 10 |
11/20/1950 | 4:00:00 | 3 | KY | 1 | 38.20 | -84.50 | 0.1 | 10 |
11/20/1950 | 7:30:00 | 3 | MS | 1 | 32.42 | -89.13 | 2.0 | 37 |
11/4/1950 | 17:00:00 | 3 | PA | 3 | 40.20 | -76.12 | 15.9 | 100 |
# The column "date" is read as a character by R. You can check this with:
class(t$date)
## [1] "character"
# Preparing a column with the date as an actual date object
t$Date <- as.Date(t$date, format = "%m/%d/%Y")
# This is now a Date object:
class(t$Date)
## [1] "Date"
# We will need to isolate the year for some later work...
t$year <- as.numeric(format(t$Date, "%Y"))
# Preparing the coordinates of the borders for our map:
map_bounds = c(left = -129, bottom = 23, right = -60, top = 50)
myMap <- get_map(location=map_bounds, zoom = 4, source="stamen")
plot(myMap)
# Some of the latitude/longitude in the dataframe are abherant.
# Let's keep only the data that falls in our mapping area:
tr <- t[which(t$lat>=map_bounds["bottom"] & t$lat<=map_bounds["top"] &
t$lon>=map_bounds["left"] & t$lon<=map_bounds["right"])
,
]
ggMap <- ggmap(myMap, extent="device", legend="none")
# Selecting only the data for the tornadoes since January 1st, 2020
trd <- t[which(t$Date > as.Date("2020-01-01", format = "%Y-%m-%d")) , ]
ggMapWithDots <- ggMap +
geom_point(data=trd, aes(x=lon, y=lat, size = len), fill="red", shape=21, alpha=0.8)
plot(ggMapWithDots)
Let’s animate this map with a display of tornadoes for every
first year of each decade:
library(gganimate)
trd <- tr[which(tr$year%%10 == 0) , ]
ggMapWithDots <- ggMap +
geom_point(data=trd, aes(x=lon, y=lat, size = len), fill="red", shape=21, alpha=0.8) +
transition_states(year, transition_length=10, state_length=30) +
labs(title = 'Year: {closest_state}')
animate(ggMapWithDots)
A current topic of research is: does climate change cause an
increase in the frequency of extreme weather events? We are not going to
answer this (complex) question in this lab. However, we will see how we
can use statistics to attach confidence levels to some observations.
Let’s start with the question: have tornadoes become dangerous
in the past decades? Our proxy for this will be the recorded length of
the tornado track (column len) in the dataframe.
We can start with a visual inspection of the data with a
boxplot:
# I start by removing all the records with a track length <= 0 (this is missing data)
t$decade <- t$year - t$year %% 10
tr <- t[which(t$len>0) , ]
boxplot(tr$len~tr$decade)
The spread of the data makes it very difficult to see anything.
The solution? Use a log-transformation:
boxplot( log(tr$len)~tr$decade)
Much better, right? But it still does not answer our
question.
Let’s use a statistical test to compare the length of tornado
tracks in the 2010s vs the 2020s. Our null hypothesis is:
tornado track lengths in the 2020s were not bigger than in the
2010s.
t.test(tr$len[which(tr$decade==2010)] , tr$len[which(tr$decade==2020)], alternative = "less")
##
## Welch Two Sample t-test
##
## data: tr$len[which(tr$decade == 2010)] and tr$len[which(tr$decade == 2020)]
## t = -2.1273, df = 3172.7, p-value = 0.01673
## alternative hypothesis: true difference in means is less than 0
## 95 percent confidence interval:
## -Inf -0.08715215
## sample estimates:
## mean of x mean of y
## 3.725735 4.110381
The t-test used here makes a number of assumptions regarding our
data. An alternative way to compute a p-value (and free yourself from
these assumptions) is to perform random drawings. In this case, the
general idea is as follows:
1. We observe that the average length of tornadoe tracks in the
2020s was x
We will answer this point by performing a number of random shuffles:
mean_len_recent <- mean(t$len[which(t$decade==2020)])
tr <- t[which( is.element(t$decade, c(2010, 2020))==T ) , ]
NB_SIMULATIONS <- 10000
v_simulation <- numeric(NB_SIMULATIONS)
for(i in (1:NB_SIMULATIONS)){
tr$sy <- sample(tr$year)
simulated_average <- mean(tr$len[which(tr$sy>=2020)])
v_simulation[i] <- simulated_average
if( i %% 1000 == 0){ print(i) }
}
## [1] 1000
## [1] 2000
## [1] 3000
## [1] 4000
## [1] 5000
## [1] 6000
## [1] 7000
## [1] 8000
## [1] 9000
## [1] 10000
hist(v_simulation)
rug(x = mean_len_recent, lwd = 4, col = "red")
length(which(v_simulation>=mean_len_recent)) / NB_SIMULATIONS
## [1] 0.0107