In this computer lab, you will learn how to import and manipulate data in R.
Start by creating a folder somewhere on your computer to store all
the files needed and created for this lab. Let’s name this folder
R and assume that it is a subfolder of a
CMB8013 folder which you had already created to store
all your class-related documents.
Create a sub-folder named data in which you will store
all the data files provided in this class.
Download the file min_temp.tab to the folder that you have just created. This file contains the monthly minimum temperature recorded by my personal weather station in the past 5 months. Open the file with a text editor (such as notepad in Windows) to see what the file looks like.
Our goal now is to load this data in R so that we can start to manipulate it.
To load this data in R, we will use the function read.table which has the following syntax:
myData = read.table("C:/path_to_folder/fileName")
Rather than using the full path to the file (starting from the root
of the file system, which is likely to be “C:” in Windows and “/” in
Mac/Linux systems), we will set the current working directory of the R
session to be in the folder that we have just created.
This can be done with the graphical interface in RStudio:
Session -> Set Working directory -> Choose
directory and navigate to your folder.
If you know the path to your folder, you can also set the working
directory directly in command line:
setwd("C:/lab/MSSTATE/Class/CMB8013/2022/R")
Make sure you use slash (/) and no backslash (\) in the path because the backslash is a special character in R (and many programming languages…)
You should now be able to run the command:
myData = read.table("data/min_temp.tab")
And display the content of your variable myData like so:
myData
## V1 V2 V3
## 1 2021 January -4.6
## 2 2020 December -6.8
## 3 2020 November -3.8
## 4 2020 October 1.7
## 5 2020 September 7.4
Note how R automatically named the different columns V1, V2, V3
You can specify the name you would like with the following function:
colnames(myData) = c("year", "month", "minTemp")
myData
## year month minTemp
## 1 2021 January -4.6
## 2 2020 December -6.8
## 3 2020 November -3.8
## 4 2020 October 1.7
## 5 2020 September 7.4
Sometimes (often), files will contain a first line (header) with the name for each column. For example, you can download the min_temp-v2.tab file
In this situation, you need to tell R that the file you are reading contains a header line with the option: header = TRUE:
myData = read.table("./data/min_temp-v2.tab", header = TRUE)
myData
## year month minTemp
## 1 2021 January -4.6
## 2 2020 December -6.8
## 3 2020 November -3.8
## 4 2020 October 1.7
## 5 2020 September 7.4
And below is the abreviated version:
myData = read.table("./data/min_temp-v2.tab", h = T)
To find out what options are available for a given function, use the ? command. For example:
?read.table
By default, the read.table function expect a
tabulation as the field seperator (= the character between each column).
I’ve prepared a file min_temp_underscore.tab
Download the file, inspect it, and load it in R. Use the R documentation
to figure out how to get the columns read correctly…
A common alternative to tabulation as a separator is the comma. It is so common, that there is a filetype named for it: csv (or comma-separated values). Simple tables can be exported using this format in MS Excel.
Download the file 2020-12-weather.csv. You can open this file with Excel (or LibreOffice/OpenOffice/…) if you want to see what’s in it. This file contains data from my personal weather station in Starkville for December 2020. The data is stored in plain text, one entry per line and columns are separated by commas (open the file with a text editor such as notepad if you want to see this).
Let’s read the file and store the result in a variable named t (as in table). This variable is a data.frame.
# Reading data from a csv (comma-separated values) file
t = read.csv("data/2020-12-weather.csv", comment.char="#")
t
Date | temperature_max | temperature_avg | temperature_low | humidity_high | humidity_avg | humidity_low | wind_max | wind_avg |
---|---|---|---|---|---|---|---|---|
12/1/2020 | 11.8 | 0.7 | -5.6 | 0.90 | 0.64 | 0.17 | 9.0 | 0.4 |
12/2/2020 | 13.4 | 2.1 | -6.3 | 0.90 | 0.67 | 0.19 | 9.0 | 0.3 |
12/3/2020 | 11.8 | 7.3 | 0.8 | 0.91 | 0.66 | 0.42 | 9.0 | 0.5 |
12/4/2020 | 11.2 | 6.6 | 0.3 | 0.91 | 0.84 | 0.60 | 11.2 | 0.6 |
12/5/2020 | 17.1 | 5.9 | -1.3 | 0.92 | 0.75 | 0.30 | 6.5 | 0.2 |
12/6/2020 | 14.5 | 6.8 | -0.8 | 0.92 | 0.74 | 0.43 | 4.0 | 0.0 |
12/7/2020 | 9.6 | 3.8 | -2.0 | 0.91 | 0.70 | 0.40 | 14.8 | 1.0 |
12/8/2020 | 15.5 | 4.3 | -3.7 | 0.91 | 0.75 | 0.37 | 5.0 | 0.2 |
12/9/2020 | 21.9 | 7.7 | -1.7 | 0.92 | 0.66 | 0.18 | 7.6 | 0.5 |
12/10/2020 | 24.0 | 9.8 | 0.6 | 0.91 | 0.67 | 0.21 | 6.5 | 0.2 |
12/11/2020 | 19.8 | 12.0 | 2.6 | 0.91 | 0.79 | 0.62 | 7.2 | 0.2 |
12/12/2020 | 18.4 | 13.5 | 6.1 | 0.92 | 0.88 | 0.71 | 6.1 | 0.3 |
12/14/2020 | 6.9 | 5.2 | 1.4 | 0.90 | 0.82 | 0.71 | 13.7 | 2.0 |
12/15/2020 | 10.8 | 5.1 | -1.5 | 0.90 | 0.74 | 0.51 | 7.9 | 0.3 |
12/16/2020 | 6.1 | 4.0 | 2.3 | 0.91 | 0.84 | 0.74 | 8.6 | 0.7 |
12/17/2020 | 8.6 | 2.6 | -3.1 | 0.90 | 0.67 | 0.33 | 9.7 | 0.7 |
12/18/2020 | 11.4 | 1.9 | -4.8 | 0.91 | 0.73 | 0.35 | 10.4 | 0.3 |
12/19/2020 | 14.1 | 5.9 | -2.4 | 0.91 | 0.74 | 0.37 | 14.0 | 0.7 |
12/20/2020 | 13.9 | 9.9 | 7.9 | 0.91 | 0.87 | 0.78 | 4.7 | 0.1 |
12/21/2020 | 19.1 | 9.5 | 2.6 | 0.91 | 0.67 | 0.22 | 10.1 | 0.9 |
12/22/2020 | 19.8 | 8.5 | 0.2 | 0.91 | 0.63 | 0.16 | 6.1 | 0.3 |
12/23/2020 | 16.6 | 9.4 | 3.3 | 0.91 | 0.73 | 0.59 | 12.2 | 1.0 |
12/24/2020 | 15.8 | 3.7 | -3.9 | 0.91 | 0.61 | 0.27 | 16.6 | 4.4 |
12/25/2020 | 5.3 | -0.8 | -6.0 | 0.81 | 0.44 | 0.16 | 12.2 | 1.8 |
12/26/2020 | 15.2 | 2.7 | -6.4 | 0.88 | 0.56 | 0.20 | 10.4 | 0.3 |
12/27/2020 | 19.2 | 7.6 | -2.1 | 0.83 | 0.52 | 0.21 | 12.2 | 0.7 |
12/28/2020 | 17.6 | 10.6 | 6.4 | 0.81 | 0.69 | 0.56 | 4.0 | 0.1 |
For very large tables, it is not recommended to print the entire table in the R console. You can use the command head to display only the first few entries:
head(t)
Note: the read.csv command is (almost) exactly the same as read.table with options: header = T, sep=“,”
Here are some useful commands to inspect the data:
# Obtaining the dimensions of the data.frame:
dim(t)
## [1] 27 9
# Number of rows:
nrow(t)
## [1] 27
# Number of columns:
ncol(t)
## [1] 9
# The name of each column:
colnames(t)
## [1] "Date" "temperature_max" "temperature_avg" "temperature_low"
## [5] "humidity_high" "humidity_avg" "humidity_low" "wind_max"
## [9] "wind_avg"
# A summary of all the data in t:
summary(t)
## Date temperature_max temperature_avg temperature_low
## Length:27 Min. : 5.30 Min. :-0.800 Min. :-6.4000
## Class :character 1st Qu.:11.30 1st Qu.: 3.750 1st Qu.:-3.4000
## Mode :character Median :14.50 Median : 5.900 Median :-1.3000
## Mean :14.42 Mean : 6.159 Mean :-0.6333
## 3rd Qu.:18.00 3rd Qu.: 8.950 3rd Qu.: 1.8500
## Max. :24.00 Max. :13.500 Max. : 7.9000
## humidity_high humidity_avg humidity_low wind_max
## Min. :0.8100 Min. :0.4400 Min. :0.1600 Min. : 4.000
## 1st Qu.:0.9000 1st Qu.:0.6600 1st Qu.:0.2100 1st Qu.: 6.500
## Median :0.9100 Median :0.7000 Median :0.3700 Median : 9.000
## Mean :0.8981 Mean :0.7041 Mean :0.3985 Mean : 9.211
## 3rd Qu.:0.9100 3rd Qu.:0.7500 3rd Qu.:0.5750 3rd Qu.:11.700
## Max. :0.9200 Max. :0.8800 Max. :0.7800 Max. :16.600
## wind_avg
## Min. :0.0000
## 1st Qu.:0.2500
## Median :0.4000
## Mean :0.6926
## 3rd Qu.:0.7000
## Max. :4.4000
Here are a few different ways to access the data in a data.frame:
# Accessing the value in row #4, column #5:
t[4,5]
## [1] 0.91
# Accessing all the data in a row #3:
t[3 , ]
## Date temperature_max temperature_avg temperature_low humidity_high
## 3 12/3/2020 11.8 7.3 0.8 0.91
## humidity_avg humidity_low wind_max wind_avg
## 3 0.66 0.42 9 0.5
# Accessing all the data in a column #2:
t[ , 2]
## [1] 11.8 13.4 11.8 11.2 17.1 14.5 9.6 15.5 21.9 24.0 19.8 18.4 6.9 10.8 6.1
## [16] 8.6 11.4 14.1 13.9 19.1 19.8 16.6 15.8 5.3 15.2 19.2 17.6
# A different way to do the same thing (accessing the column by name rather than index):
t[ , "temperature_max"]
## [1] 11.8 13.4 11.8 11.2 17.1 14.5 9.6 15.5 21.9 24.0 19.8 18.4 6.9 10.8 6.1
## [16] 8.6 11.4 14.1 13.9 19.1 19.8 16.6 15.8 5.3 15.2 19.2 17.6
# And a different notation to do the exact same thing:
t$temperature_max
## [1] 11.8 13.4 11.8 11.2 17.1 14.5 9.6 15.5 21.9 24.0 19.8 18.4 6.9 10.8 6.1
## [16] 8.6 11.4 14.1 13.9 19.1 19.8 16.6 15.8 5.3 15.2 19.2 17.6
# Accessing the data in rows 5 to 10, column #2:
t[5:10 , 2]
## [1] 17.1 14.5 9.6 15.5 21.9 24.0
# Same thing:
t$temperature_max[5:10]
## [1] 17.1 14.5 9.6 15.5 21.9 24.0
Let’s find the maximum temperature recorded by the weather station during this period (R provides a function max which returns the maximum value in a vector):
max(t$temperature_max)
## [1] 24
We can also perform some basic manipulation of the data. Let’s compute the daily difference between highest and lowest temperature:
t$deltaTemp = t$temperature_max - t$temperature_low
And let’s have a look at what these numbers look like, using a graphical representation (histogram):
hist(t$deltaTemp)
We could also use a boxplot representation:
boxplot(t$deltaTemp)
Let’s go beyond these basic steps and start harvesting the power of R by doing some statistical analysis of the data. For this toy example, we would like to test if there is a correlation between wind speed and temperature. More specifically, we want to test for a significant correlation between the average daily temperature and the maximum wind speed recorded on that day.
We can start by visualizing the data with a scatter plot:
plot(t$temperature_avg~t$wind_max)
Note: you would have obtained the same result with this slightly different notation:
plot(t$wind_max, t$temperature_avg)
It looks like there might be a slight negative correlation. To test for the significance of this correlation, we are going to use linear models and the function lm:
summary(lm(t$temperature_avg~t$wind_max))
##
## Call:
## lm(formula = t$temperature_avg ~ t$wind_max)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.6056 -2.6784 0.8872 1.9951 5.9317
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 10.3310 1.8499 5.585 8.28e-06 ***
## t$wind_max -0.4529 0.1889 -2.398 0.0243 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.265 on 25 degrees of freedom
## Multiple R-squared: 0.187, Adjusted R-squared: 0.1544
## F-statistic: 5.749 on 1 and 25 DF, p-value: 0.02429
We can see that the adjusted R-squared value is 0.15 and the corresponding p-value 0.02, suggesting a possible correlation between the two variables.
Write a function (myCor) which computes the Pearson correlation coefficient between two vectors of variables. As a reminder, the formula for the Pearson correlation coefficient is given by:
Here is the type of result you should obtain:
myCor(t$temperature_avg,t$wind_max)
## [1] -0.4323936
Once you have this function, write some code using a loop to find the Pearson correlation coefficient between the value in a given column and all the other columns.
Here is what running this function should look like:
testAllCorrelations(t, "temperature_avg")
## Correlation between temperature_avg and temperature_max = 0.6826209
## Correlation between temperature_avg and temperature_low = 0.8455516
## Correlation between temperature_avg and humidity_high = 0.1710817
## Correlation between temperature_avg and humidity_avg = 0.4458135
## Correlation between temperature_avg and humidity_low = 0.4385809
## Correlation between temperature_avg and wind_max = -0.4323936
## Correlation between temperature_avg and wind_avg = -0.2926063
## Correlation between temperature_avg and deltaTemp = 0.0008163158
testAllCorrelations(t, "wrong_column_name")
## wrong_column_name is not a valid column name.
testAllCorrelations(t, "Date")
## The values in Date are not numerics...
And here are two pieces of information that should help you for this exercise:
is.element("Date", colnames(t))
## [1] TRUE
is.element("blabla", colnames(t))
## [1] FALSE