Importing and manipulating data

In this computer lab, you will learn how to import and manipulate data in R.

Start by creating a folder somewhere on your computer to store all the files needed and created for this lab. Let’s name this folder R and assume that it is a subfolder of a CMB8013 folder which you had already created to store all your class-related documents.
Create a sub-folder named data in which you will store all the data files provided in this class.

Download the file min_temp.tab to the folder that you have just created. This file contains the monthly minimum temperature recorded by my personal weather station in the past 5 months. Open the file with a text editor (such as notepad in Windows) to see what the file looks like.

Checking the file in a text editor

Our goal now is to load this data in R so that we can start to manipulate it.

Locating and loading your data.

To load this data in R, we will use the function read.table which has the following syntax:

myData = read.table("C:/path_to_folder/fileName")

Rather than using the full path to the file (starting from the root of the file system, which is likely to be “C:” in Windows and “/” in Mac/Linux systems), we will set the current working directory of the R session to be in the folder that we have just created.
This can be done with the graphical interface in RStudio: Session -> Set Working directory -> Choose directory and navigate to your folder.
If you know the path to your folder, you can also set the working directory directly in command line:

setwd("C:/lab/MSSTATE/Class/CMB8013/2022/R")

Make sure you use slash (/) and no backslash (\) in the path because the backslash is a special character in R (and many programming languages…)

You should now be able to run the command:

myData = read.table("data/min_temp.tab")

And display the content of your variable myData like so:

myData

##     V1        V2   V3
## 1 2021   January -4.6
## 2 2020  December -6.8
## 3 2020  November -3.8
## 4 2020   October  1.7
## 5 2020 September  7.4

Note how R automatically named the different columns V1, V2, V3

You can specify the name you would like with the following function:

colnames(myData) = c("year", "month", "minTemp")
myData

##   year     month minTemp
## 1 2021   January    -4.6
## 2 2020  December    -6.8
## 3 2020  November    -3.8
## 4 2020   October     1.7
## 5 2020 September     7.4

Sometimes (often), files will contain a first line (header) with the name for each column. For example, you can download the min_temp-v2.tab file

min_temp-v2.tab

In this situation, you need to tell R that the file you are reading contains a header line with the option: header = TRUE:

myData = read.table("./data/min_temp-v2.tab", header = TRUE)
myData

##   year     month minTemp
## 1 2021   January    -4.6
## 2 2020  December    -6.8
## 3 2020  November    -3.8
## 4 2020   October     1.7
## 5 2020 September     7.4

And below is the abreviated version:

myData = read.table("./data/min_temp-v2.tab", h = T)

To find out what options are available for a given function, use the ? command. For example:

?read.table

Mini exercise.

By default, the read.table function expect a tabulation as the field seperator (= the character between each column). I’ve prepared a file min_temp_underscore.tab
Download the file, inspect it, and load it in R. Use the R documentation to figure out how to get the columns read correctly…

Reading data from a CSV (comma-separated values) file.

A common alternative to tabulation as a separator is the comma. It is so common, that there is a filetype named for it: csv (or comma-separated values). Simple tables can be exported using this format in MS Excel.

Download the file 2020-12-weather.csv. You can open this file with Excel (or LibreOffice/OpenOffice/…) if you want to see what’s in it. This file contains data from my personal weather station in Starkville for December 2020. The data is stored in plain text, one entry per line and columns are separated by commas (open the file with a text editor such as notepad if you want to see this).

Let’s read the file and store the result in a variable named t (as in table). This variable is a data.frame.

# Reading data from a csv (comma-separated values) file
t = read.csv("data/2020-12-weather.csv", comment.char="#")
t

Date	temperature_max	temperature_avg	temperature_low	humidity_high	humidity_avg	humidity_low	wind_max	wind_avg
12/1/2020	11.8	0.7	-5.6	0.90	0.64	0.17	9.0	0.4
12/2/2020	13.4	2.1	-6.3	0.90	0.67	0.19	9.0	0.3
12/3/2020	11.8	7.3	0.8	0.91	0.66	0.42	9.0	0.5
12/4/2020	11.2	6.6	0.3	0.91	0.84	0.60	11.2	0.6
12/5/2020	17.1	5.9	-1.3	0.92	0.75	0.30	6.5	0.2
12/6/2020	14.5	6.8	-0.8	0.92	0.74	0.43	4.0	0.0
12/7/2020	9.6	3.8	-2.0	0.91	0.70	0.40	14.8	1.0
12/8/2020	15.5	4.3	-3.7	0.91	0.75	0.37	5.0	0.2
12/9/2020	21.9	7.7	-1.7	0.92	0.66	0.18	7.6	0.5
12/10/2020	24.0	9.8	0.6	0.91	0.67	0.21	6.5	0.2
12/11/2020	19.8	12.0	2.6	0.91	0.79	0.62	7.2	0.2
12/12/2020	18.4	13.5	6.1	0.92	0.88	0.71	6.1	0.3
12/14/2020	6.9	5.2	1.4	0.90	0.82	0.71	13.7	2.0
12/15/2020	10.8	5.1	-1.5	0.90	0.74	0.51	7.9	0.3
12/16/2020	6.1	4.0	2.3	0.91	0.84	0.74	8.6	0.7
12/17/2020	8.6	2.6	-3.1	0.90	0.67	0.33	9.7	0.7
12/18/2020	11.4	1.9	-4.8	0.91	0.73	0.35	10.4	0.3
12/19/2020	14.1	5.9	-2.4	0.91	0.74	0.37	14.0	0.7
12/20/2020	13.9	9.9	7.9	0.91	0.87	0.78	4.7	0.1
12/21/2020	19.1	9.5	2.6	0.91	0.67	0.22	10.1	0.9
12/22/2020	19.8	8.5	0.2	0.91	0.63	0.16	6.1	0.3
12/23/2020	16.6	9.4	3.3	0.91	0.73	0.59	12.2	1.0
12/24/2020	15.8	3.7	-3.9	0.91	0.61	0.27	16.6	4.4
12/25/2020	5.3	-0.8	-6.0	0.81	0.44	0.16	12.2	1.8
12/26/2020	15.2	2.7	-6.4	0.88	0.56	0.20	10.4	0.3
12/27/2020	19.2	7.6	-2.1	0.83	0.52	0.21	12.2	0.7
12/28/2020	17.6	10.6	6.4	0.81	0.69	0.56	4.0	0.1

For very large tables, it is not recommended to print the entire table in the R console. You can use the command head to display only the first few entries:

head(t)

Note: the read.csv command is (almost) exactly the same as read.table with options: header = T, sep=“,”

Inspecting the data.

Here are some useful commands to inspect the data:

# Obtaining the dimensions of the data.frame:
dim(t)

## [1] 27  9

# Number of rows:
nrow(t)

## [1] 27

# Number of columns:
ncol(t)

## [1] 9

# The name of each column:
colnames(t)

## [1] "Date"            "temperature_max" "temperature_avg" "temperature_low"
## [5] "humidity_high"   "humidity_avg"    "humidity_low"    "wind_max"       
## [9] "wind_avg"

# A summary of all the data in t:
summary(t)

##      Date           temperature_max temperature_avg  temperature_low  
##  Length:27          Min.   : 5.30   Min.   :-0.800   Min.   :-6.4000  
##  Class :character   1st Qu.:11.30   1st Qu.: 3.750   1st Qu.:-3.4000  
##  Mode  :character   Median :14.50   Median : 5.900   Median :-1.3000  
##                     Mean   :14.42   Mean   : 6.159   Mean   :-0.6333  
##                     3rd Qu.:18.00   3rd Qu.: 8.950   3rd Qu.: 1.8500  
##                     Max.   :24.00   Max.   :13.500   Max.   : 7.9000  
##  humidity_high     humidity_avg     humidity_low       wind_max     
##  Min.   :0.8100   Min.   :0.4400   Min.   :0.1600   Min.   : 4.000  
##  1st Qu.:0.9000   1st Qu.:0.6600   1st Qu.:0.2100   1st Qu.: 6.500  
##  Median :0.9100   Median :0.7000   Median :0.3700   Median : 9.000  
##  Mean   :0.8981   Mean   :0.7041   Mean   :0.3985   Mean   : 9.211  
##  3rd Qu.:0.9100   3rd Qu.:0.7500   3rd Qu.:0.5750   3rd Qu.:11.700  
##  Max.   :0.9200   Max.   :0.8800   Max.   :0.7800   Max.   :16.600  
##     wind_avg     
##  Min.   :0.0000  
##  1st Qu.:0.2500  
##  Median :0.4000  
##  Mean   :0.6926  
##  3rd Qu.:0.7000  
##  Max.   :4.4000

Accessing the data.

Here are a few different ways to access the data in a data.frame:

# Accessing the value in row #4, column #5:
t[4,5]

## [1] 0.91

# Accessing all the data in a row #3:
t[3 , ]

##        Date temperature_max temperature_avg temperature_low humidity_high
## 3 12/3/2020            11.8             7.3             0.8          0.91
##   humidity_avg humidity_low wind_max wind_avg
## 3         0.66         0.42        9      0.5

# Accessing all the data in a column #2:
t[ , 2]

##  [1] 11.8 13.4 11.8 11.2 17.1 14.5  9.6 15.5 21.9 24.0 19.8 18.4  6.9 10.8  6.1
## [16]  8.6 11.4 14.1 13.9 19.1 19.8 16.6 15.8  5.3 15.2 19.2 17.6

# A different way to do the same thing (accessing the column by name rather than index):
t[ , "temperature_max"]

##  [1] 11.8 13.4 11.8 11.2 17.1 14.5  9.6 15.5 21.9 24.0 19.8 18.4  6.9 10.8  6.1
## [16]  8.6 11.4 14.1 13.9 19.1 19.8 16.6 15.8  5.3 15.2 19.2 17.6

# And a different notation to do the exact same thing:
t$temperature_max

##  [1] 11.8 13.4 11.8 11.2 17.1 14.5  9.6 15.5 21.9 24.0 19.8 18.4  6.9 10.8  6.1
## [16]  8.6 11.4 14.1 13.9 19.1 19.8 16.6 15.8  5.3 15.2 19.2 17.6

# Accessing the data in rows 5 to 10,  column #2:
t[5:10 , 2]

## [1] 17.1 14.5  9.6 15.5 21.9 24.0

# Same thing:
t$temperature_max[5:10]

## [1] 17.1 14.5  9.6 15.5 21.9 24.0

Let’s find the maximum temperature recorded by the weather station during this period (R provides a function max which returns the maximum value in a vector):

max(t$temperature_max)

## [1] 24

Simple exercise: find the lowest temperature recorded by the weather station.

——————————–

We can also perform some basic manipulation of the data. Let’s compute the daily difference between highest and lowest temperature:

t$deltaTemp = t$temperature_max - t$temperature_low

And let’s have a look at what these numbers look like, using a graphical representation (histogram):

hist(t$deltaTemp)

We could also use a boxplot representation:

boxplot(t$deltaTemp)

Analyzing the data.

Let’s go beyond these basic steps and start harvesting the power of R by doing some statistical analysis of the data. For this toy example, we would like to test if there is a correlation between wind speed and temperature. More specifically, we want to test for a significant correlation between the average daily temperature and the maximum wind speed recorded on that day.

We can start by visualizing the data with a scatter plot:

plot(t$temperature_avg~t$wind_max)

Note: you would have obtained the same result with this slightly different notation:

plot(t$wind_max, t$temperature_avg)

It looks like there might be a slight negative correlation. To test for the significance of this correlation, we are going to use linear models and the function lm:

summary(lm(t$temperature_avg~t$wind_max))

## 
## Call:
## lm(formula = t$temperature_avg ~ t$wind_max)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.6056 -2.6784  0.8872  1.9951  5.9317 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  10.3310     1.8499   5.585 8.28e-06 ***
## t$wind_max   -0.4529     0.1889  -2.398   0.0243 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.265 on 25 degrees of freedom
## Multiple R-squared:  0.187,  Adjusted R-squared:  0.1544 
## F-statistic: 5.749 on 1 and 25 DF,  p-value: 0.02429

We can see that the adjusted R-squared value is 0.15 and the corresponding p-value 0.02, suggesting a possible correlation between the two variables.

Simple exercise: test the correlation between maximum daily wind speed (wind_max) and lowest daily humidity (humidity_low)

Lab report exercise

Part 1:

Write a function (myCor) which computes the Pearson correlation coefficient between two vectors of variables. As a reminder, the formula for the Pearson correlation coefficient is given by:

Here is the type of result you should obtain:

myCor(t$temperature_avg,t$wind_max)

## [1] -0.4323936

Part 2:

Once you have this function, write some code using a loop to find the Pearson correlation coefficient between the value in a given column and all the other columns.

Here is what running this function should look like:

testAllCorrelations(t, "temperature_avg")

## Correlation between  temperature_avg  and  temperature_max  =  0.6826209 
## Correlation between  temperature_avg  and  temperature_low  =  0.8455516 
## Correlation between  temperature_avg  and  humidity_high  =  0.1710817 
## Correlation between  temperature_avg  and  humidity_avg  =  0.4458135 
## Correlation between  temperature_avg  and  humidity_low  =  0.4385809 
## Correlation between  temperature_avg  and  wind_max  =  -0.4323936 
## Correlation between  temperature_avg  and  wind_avg  =  -0.2926063 
## Correlation between  temperature_avg  and  deltaTemp  =  0.0008163158

testAllCorrelations(t, "wrong_column_name")

## wrong_column_name  is not a valid column name.

testAllCorrelations(t, "Date")

## The values in  Date  are not numerics...

And here are two pieces of information that should help you for this exercise:

is.element(variable, vector) tests if “variable” is part of “vector” or not.
For example:

is.element("Date", colnames(t))

## [1] TRUE

is.element("blabla", colnames(t))

## [1] FALSE

You can test if data is numeric with the function: is.numeric