Applying the basics of efficient R programming



For this exercise, you will need to download the file: temp_stations.csv

Load the content of this file into a variable in your R environment with the following command:

t <- read.csv("data/temp_stations.csv") # You may need to adjust the path to the file, depending on how you have organized your folders.


This file contains daily temperature records for weather stations located in Starkville, Columbs, Jackson, Meridian and Biloxi:

head(t)
##     datetime starkville   columbus   meridian     biloxi
## 1 2022-01-01       18.1 15.3903241 19.1763887 19.8906805
## 2 2022-01-02       -0.9  0.7630577 -0.3803543  0.8910247
## 3 2022-01-03       -1.9 -7.3876835 -0.8971593  1.0825752
## 4 2022-01-04         NA  0.3948917 -0.1458191  5.6266367
## 5 2022-01-05        1.1 -1.7254413 -1.8997366  8.0543028
## 6 2022-01-06       -1.6 -4.0392856 -0.9410461 -0.2371790


Before we start the actual exercise, let’s have a quick look at the data. Your first task is to plot the temperatures for the Starkville station as a function of the date. Your plot should like this:

Part 1 - No more for loops…

For the following exercise, you will not write a single for loop! The list of functions which you will need for part 1 of this exercise is (do not use any other functions!):

colnames (optional)
apply
min
which.min
names (optional)
table
month
by
t.test

1.1 Add a column with the minimum temperature recorded each day.


Your dataframe should now look like this:

date starkville columbus meridian biloxi mintemp
2022-01-01 18.1 15.3903241 19.1763887 19.8906805 15.3903241
2022-01-02 -0.9 0.7630577 -0.3803543 0.8910247 -0.9000000
2022-01-03 -1.9 -7.3876835 -0.8971593 1.0825752 -7.3876835
2022-01-04 NA 0.3948917 -0.1458191 5.6266367 -0.1458191
2022-01-05 1.1 -1.7254413 -1.8997366 8.0543028 -1.8997366
2022-01-06 -1.6 -4.0392856 -0.9410461 -0.2371790 -4.0392856

1.2 Find how many times each station was the coldest of the four.


To do so, we will start by adding a column indicating which station was the coldest. We’ll start by recording the indice (starkville = 1, columbus = 2, meridian = 3, …) and we can match the indice to the name later.

date starkville columbus meridian biloxi mintemp coldestStation
2022-01-01 18.1 15.3903241 19.1763887 19.8906805 15.3903241 2
2022-01-02 -0.9 0.7630577 -0.3803543 0.8910247 -0.9000000 1
2022-01-03 -1.9 -7.3876835 -0.8971593 1.0825752 -7.3876835 2
2022-01-04 NA 0.3948917 -0.1458191 5.6266367 -0.1458191 3
2022-01-05 1.1 -1.7254413 -1.8997366 8.0543028 -1.8997366 3
2022-01-06 -1.6 -4.0392856 -0.9410461 -0.2371790 -4.0392856 2


You can now use the function table to find how many times each indice appears in the colum coldestStation. Store the result in a variable named cold

## 
##   1   2   3   4 
##  82 143 137   3


Use the function names to make your variable cold look nicer:

## starkville   columbus   meridian     biloxi 
##         82        143        137          3


1.3 Summary per month


Add a column containing the month of each observation and then use the function by to obtain the minimum temperature recorded in Starkville for each month.


## t$month: April
## [1] 0.1
## ------------------------------------------------------------ 
## t$month: August
## [1] 20.1
## ------------------------------------------------------------ 
## t$month: December
## [1] -13.9
## ------------------------------------------------------------ 
## t$month: February
## [1] -6.1
## ------------------------------------------------------------ 
## t$month: January
## [1] -8.3
## ------------------------------------------------------------ 
## t$month: July
## [1] 21.1
## ------------------------------------------------------------ 
## t$month: June
## [1] 16.1
## ------------------------------------------------------------ 
## t$month: March
## [1] -4.9
## ------------------------------------------------------------ 
## t$month: May
## [1] 10.1
## ------------------------------------------------------------ 
## t$month: November
## [1] -4.9
## ------------------------------------------------------------ 
## t$month: October
## [1] -1.9
## ------------------------------------------------------------ 
## t$month: September
## [1] 8.1


Part 2 - Statistics


According to this data, were the 2022 temperatures statistically significantly different between Starkville and Columbus? (at the 5% confidence level)
How about between Starkville and Biloxi?

Part 3 - Tibbles, tidyr and dplyr


We will use the functionalities of tibbles and a few R packages to quickly generate a beautiful plot of the temperatures as a function of the date for each station.

3.1 - From dataframes to tibbles


Convert the dataframe t into a tibble tb:

library("tibble")
library("dplyr")
library("tidyr")
library("ggplot2")

# Let's restart from a fresh version of the t dataframe:
t <- read.csv("data/temp_stations.csv")

tb <- tibble(t)
tb
## # A tibble: 365 × 5
##    datetime   starkville columbus meridian biloxi
##    <chr>           <dbl>    <dbl>    <dbl>  <dbl>
##  1 2022-01-01       18.1   15.4     19.2   19.9  
##  2 2022-01-02       -0.9    0.763   -0.380  0.891
##  3 2022-01-03       -1.9   -7.39    -0.897  1.08 
##  4 2022-01-04       NA      0.395   -0.146  5.63 
##  5 2022-01-05        1.1   -1.73    -1.90   8.05 
##  6 2022-01-06       -1.6   -4.04    -0.941 -0.237
##  7 2022-01-07       -3.9   -4.84    -4.30   1.73 
##  8 2022-01-08       -2.9   -2.87    -2.16  -0.184
##  9 2022-01-09        6.2    2.84     3.11   9.73 
## 10 2022-01-10       -0.9    2.45    -1.92   3.30 
## # … with 355 more rows


Compare what you see (when you type tb) with what happens when you display a dataframe (by typing t in the R console)


3.2 Preparing the data for ggplot


Before we can plot the data with ggplot, we need to convert our dataframe tibble so that there is only one column with the value to be ploted (temperature) and one columns indicating the name of the station. Use the function pivot_long (from the package dplyr) to perform this miracle and make sure that your tibble now looks like this:

## # A tibble: 1,460 × 3
##    datetime   station    temperature
##    <chr>      <chr>            <dbl>
##  1 2022-01-01 starkville      18.1  
##  2 2022-01-01 columbus        15.4  
##  3 2022-01-01 meridian        19.2  
##  4 2022-01-01 biloxi          19.9  
##  5 2022-01-02 starkville      -0.9  
##  6 2022-01-02 columbus         0.763
##  7 2022-01-02 meridian        -0.380
##  8 2022-01-02 biloxi           0.891
##  9 2022-01-03 starkville      -1.9  
## 10 2022-01-03 columbus        -7.39 
## # … with 1,450 more rows

3.3 Plotting the data


Use the function mutate to generate a column date which contains a date object.
You now have everything ready to generate a plot like this: