A few extras from our second lab.

Linear models and extracting data from lm

In the previous lab, we use the function lm to fit a linear model. The goal was to test for a correlation between temperature and wind in the “weather data”. We used the function summary to display the information returned by lm. Let’s look at how we can extract the useful data returned by lm.

t = read.csv("data/2020-12-weather.csv", comment.char="#")

res = lm(t$temperature_avg~t$wind_max)

res
## 
## Call:
## lm(formula = t$temperature_avg ~ t$wind_max)
## 
## Coefficients:
## (Intercept)   t$wind_max  
##     10.3310      -0.4529
summary(res)
## 
## Call:
## lm(formula = t$temperature_avg ~ t$wind_max)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.6056 -2.6784  0.8872  1.9951  5.9317 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  10.3310     1.8499   5.585 8.28e-06 ***
## t$wind_max   -0.4529     0.1889  -2.398   0.0243 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.265 on 25 degrees of freedom
## Multiple R-squared:  0.187,  Adjusted R-squared:  0.1544 
## F-statistic: 5.749 on 1 and 25 DF,  p-value: 0.02429

Let’s write our own version of the summary function to give us control over how the data is presented. Our goal is to have a function that displays the output with this format:

## The adjusted r-squared is X and the p-value is Y.

To do so, we could use the object returned by lm and perform our own calculations of r-squared and p-value. But to keep things easy, we’ll “cheat” and simply extract this information from summary. To know which field need to be accessed, we can use the following code:

attributes( summary(res) )
## $names
##  [1] "call"          "terms"         "residuals"     "coefficients" 
##  [5] "aliased"       "sigma"         "df"            "r.squared"    
##  [9] "adj.r.squared" "fstatistic"    "cov.unscaled" 
## 
## $class
## [1] "summary.lm"

Note: we could have writen it this way (by adding an intermediary variable):

res_summary = summary(res)
attributes(res_summary)
## $names
##  [1] "call"          "terms"         "residuals"     "coefficients" 
##  [5] "aliased"       "sigma"         "df"            "r.squared"    
##  [9] "adj.r.squared" "fstatistic"    "cov.unscaled" 
## 
## $class
## [1] "summary.lm"

You can access the content of a given field with the following syntax:

## obejctName$fieldName

For example:

res_summary$r.squared
## [1] 0.1869642

Note that you could have used this notation to perform the exact same thing:

summary(res)$r.squared
## [1] 0.1869642

You might notice that the is no field named p-value. So, where is the p-value stored? The answer is: it is actually re-computed on the fly from the fstatistic value.

summary(res)$fstatistic
##     value     numdf     dendf 
##  5.748955  1.000000 25.000000

Here is how to compute the p-value from this f statistic:

fstat = summary(res)$fstatistic
pval = pf(fstat[1],fstat[2],fstat[3],lower.tail=F)
pval
##      value 
## 0.02428936

If you are not familiar with F statistics: https://stattrek.com/probability-distributions/f-distribution.aspx

Now, write the function mySummary that will perform the following action:

res = lm(t$temperature_avg~t$wind_max)
mySummary(res)
## The adjusted r-squared is  0.1869642 and the p-value is  0.02428936 .

Extras:
1. Find a way to avoid the extra space after the pvalue.
2. Find a way to round up the numbers to the Xth decimal point (X being passed as an argument to the function and having by default a value of 2).

mySummaryExtra(res)
## The adjusted r-squared is 0.19 and the p-value is 0.02.
mySummaryExtra(res, precision=4)
## The adjusted r-squared is 0.187 and the p-value is 0.0243.