# Contents

1. What is linear regression analysis and why is it used?
2. The Least Squares Regression equation
3. Assumptions of regression modeling
4. Validating the assumptions of regression modeling
5. Exercises
6. References

# 1. What is linear regression analysis and why is it used? (1/3)

• We will be working with an extended version of the dataset by Balota et al. (2007) we have seen before (Levshina, 2015):
• Word
• Length: word length
• SUBTLWF: normalized word frequency
• POS: part-of-speech
• Mean_RT: Mean reaction time
• Research question:
• To what extent and how do word length, word frequency, and part-of speech condition subjects’ reaction times in a lexical decision task?
library(readr)
library(dplyr)
library(ggplot2)

glimpse(dataSet)
## Observations: 880
## Variables: 5
## $Word <chr> "rackets", "stepmother", "delineated", "swimmers", "um... ##$ Length  <int> 7, 10, 10, 8, 6, 5, 5, 8, 8, 6, 8, 12, 8, 6, 7, 3, 3, ...
## $SUBTLWF <dbl> 0.96, 4.24, 0.04, 1.49, 1.06, 3.33, 0.10, 0.06, 0.43, ... ##$ POS     <chr> "NN", "NN", "VB", "NN", "NN", "NN", "VB", "NN", "NN", ...
dataSet<-dataSet[abs(scale(dataSet$Length)) <= 2, ] # 4.4 There are no outliers/overly influential observations • Remove outliers based on MAD-scores/interquartile range: • The SUBTLWF variable has the typical power-law distribution of word frequencies • Here we have two possibilities: • We exclude all variables that have a value smaller/higher than the first/third quartile minus/plus 1.5 times the interquartile range. Note that we have to use which to find those values, because we need to specify an AND statement (written as &) • We exclude all variables that have an absolute value greater than 3.5 when the median is substracted from the value and divided by the median absolute deviation dataSet <- dataSet[abs((dataSet$SUBTLWF-median(dataSet$SUBTLWF))/mad(dataSet$SUBTLWF)) <= 3.5, ]
# dataSet<- dataSet[which(dataSet$SUBTLWF >= quantile(dataSet$SUBTLWF, 0.25) - (1.5*IQR(dataSet$SUBTLWF)) & # dataSet$SUBTLWF <= quantile(dataSet$SUBTLWF, 0.75) + (1.5*IQR(dataSet$SUBTLWF)) ),]
# We refit our model to the changed data
mod <- lm(Mean_RT ~ Length +   SUBTLWF + POS, data = dataSet)

# 4.4 There are no outliers/overly influential observations

• After these steps, we may define an intial regression model
• The influencePlot function from the car package identifies the row indices of too influential observations

# 4.4 There are no outliers/overly influential observations

• The plot combines three values:
• Studentized residuals (Y-axis). These values represent the difference between observed and predicted values. Values more extreme than 2 should be investigated
• Hat-values or leverage (X-axis). How much influence an observation may have on the predicted values
• Cook’s distance (size of the bubbles). How much influence removing an observation may have on the regression coefficients
library(car)
influencePlot(mod) ##      StudRes        Hat       CookD
## 58  3.777528 0.01748908 0.049795555
## 229 1.395448 0.02453357 0.009780932

# 4.4 There are no outliers/overly influential observations

• The function returns a data.frame with the statistics for the most suspicious observations
• The row.names of the data.frame are the indices of the observations that should be removed
• It is usually a good idea to remove these
library(car)
indices <-row.names( influencePlot(mod)) indices <-as.numeric(indices)
dataSet <- dataSet[-indices, ]
# We refit our model to the changed data
mod <- lm(Mean_RT ~ Length +   SUBTLWF + POS, data = dataSet)

# 4.5 The relationship between the dependent and independent variables is linear

• The first thing you should do is plotting the relationship of your numeric predictors to the dependent variable to fix the easy to spot non-linear trends
• For regression models the crPlot function in the car package plots the relationship between X and Y
• There are two lines on the plot:
• The dashed red line plots the expected linear relationship
• The green line plots the observed relationship
library(car)
crPlot(mod, var = "Length") # 4.5 The relationship between the dependent and independent variables is linear

• If the two lines diverge greatly, the relationship is not linear
• The relationships are not perfectly linear, but they come pretty close
• Non-linear relationships may be made linear with transformations of the predictor variable
• In other cases, it is more appropriate to model them with polynomial regression terms, linear approximations of the curvy relationship (see Topic 12). This makes the model less easily interpretable
library(car)
crPlot(mod, var = "SUBTLWF") # 4.6 There is no heteroskedasticity

• The previous two plots reveal that there is a bigger issue than some non-linearity: heteroskedasticity. Both for the Length and the SUBTLWF variable the plots display a funnel-like pattern
• This means that the variability of the residuals covaries (i.e, increases/decreases) with the explanatory variables or the response

# 4.6 There is no heteroskedasticity

plot(mod, which = 1) # 4.7 There is no heteroskedasticity

• The car package includes the ncvTest function
• If p < 0.05, the fitted (predicted) values have non-constant variance with the residuals
• This test establishes whether or not heteroskedasticity is an issue for the model
• Here it is clear that our model suffers from heteroskedasticity
library(car)
ncvTest(mod)
## Non-constant Variance Score Test
## Variance formula: ~ fitted.values
## Chisquare = 7.719775    Df = 1     p = 0.005461922

# 4.7 There is no heteroskedasticity

• We should also check whether the residuals have constant variance depending on the values of quantitative explanatory variables
• We perform the same ncvTest test, but we add the predictor variable
• This allows us to check if the problem is due to specific variables. Here we find that the SUBTLWF variable is the main culprit
library(car)
ncvTest(mod, ~ Length)
## Non-constant Variance Score Test
## Variance formula: ~ Length
## Chisquare = 2.077236    Df = 1     p = 0.1495107
ncvTest(mod, ~ SUBTLWF)
## Non-constant Variance Score Test
## Variance formula: ~ SUBTLWF
## Chisquare = 5.790808    Df = 1     p = 0.01611018
ncvTest(mod, ~POS)
## Non-constant Variance Score Test
## Variance formula: ~ POS
## Chisquare = 1.986296    Df = 2     p = 0.3704088

# 4.7 There is no heteroskedasticity

• Heteroskedasticity can be fixed with a power transformation of the dependent variable
• We can try to find the right transformation by trial and error, or we can use a boxCox plot (from the car package)
• The X-value with the highest Y-value corresponds to the power of the optimal transformation for our dependent variable
• If the Y-value is zero, the optimal transformation is the natural logarithm
library(car)
boxCox(mod) # 4.7 There is no heteroskedasticity

• We can find the exact x-value that corresponds with the highest y-value as follows:
• We store the data for the plot in an object
• We set plotit to FALSE so we don’t get a plot output
• We subset the x-vector by calling which.max (which vector item has the highest value) on the y-values
library(car)
testValues <- boxCox(mod, plotit=FALSE)
power <- testValues$x[which.max(testValues$y)]
power
##  -0.6

# 4.7 There is no heteroskedasticity

• All we have to do now to fix the problem is raising Mean_RT to the power of -0.6 and compute the model again
• When we perform the ncvTest again, the p-value is higher than 0.05, indicating that heteroskedasticity is no longer an issue
library(car)
dataSet$Mean_RT <- dataSet$Mean_RT^power
mod <- lm(Mean_RT ~ Length +  SUBTLWF, data = dataSet)
ncvTest(mod)
## Non-constant Variance Score Test
## Variance formula: ~ fitted.values
## Chisquare = 0.2232123    Df = 1     p = 0.6366031

# 4.8 The explanatory variables are not correlated (there is no multicollinearity)

• For numerical variables:
• Perform pairwise correlation tests (see Topic 5)
• You will want to see p-values higher than 0.05 and/or weak to moderate correlations
• For categorical variables:
• Perform pariwise Chi-squared tests of independence and calculate the effect sizes (see Topic 6)
• You will want to see p-values higher than 0.05 and/or small to moderate effect sizes
• You can also assess multicollinearity directly from a fitted regression model:
• Calculate the Variance Inflation Factors (VIF)
• VIF <= 5 indicates that there is no severe multicollinearity
• vif in the car package
library(car)
vif(mod)
##   Length  SUBTLWF
## 1.019339 1.019339

# 4.8 The explanatory variables are not correlated (there is no multicollinearity)

• If some of your variables are collinear you should drop one of them
• If it is of theoretical interest to investigate both variables, you can run parallel regression models, which each include one of the variables
• If interpretability is not of primary concern (say, when the collinear variables only control for biases), techniques such as principal components (for quantitative variables) or factor analysis (for factors) can combine the two collinear variables into one or more non-correlated variables

# 4.9 The residuals are not autocorrelated

• Just like the observations cannot be autocorrelated, the residuals cannot be autocorrelated either
• To detect autocorrelation, we can perform a durbinWatsonTest
• durbinWatsonTest in the car package
• If p > 0.05 autocorrelation is not an issue
library(car)
durbinWatsonTest(mod)
##  lag Autocorrelation D-W Statistic p-value
##    1      0.04514735      1.908066   0.252
##  Alternative hypothesis: rho != 0

# 4.9 The residuals are not autocorrelated

• If the residuals are autocorrelated this may mean that:
• We did not include a variable that ought to be included
• Solution: Think, plot, and add the predictor
• We misspecified the shape of some critical relationship. This happens when we fit a straight line for a predictor that has a non-linear relationship to the dependent variable.
• Solution: Add a non-linear (polynomial) term (see Topic 12)
• We have made some consistent error while measuring the dependent variable
• Solution: Find the observations with errors and exclude them
• Autocorrelation will not affect the sizes and the numeric stability of the regression coefficients, but it will affect their standard errors, t-values, and p-values
• Autocorrelation will also affect the predictive accuracy of your model

# 4.10 The residuals follow the normal distribution

• If everything is in order, the residuals (the “prediction errors” of our model) should be completely random, resulting in normally distributed values
• If the residuals are not normally distributed, this could indicate that our model misses important information
• E.g., We may have failed to include an important variable
• To inspect the residuals you can simply draw a qqplot with a qqline of the residuals (resid function).
• Alternatively, you can draw the plot with plot(mod, which = 2), which adds index labels for problematic observations
qqnorm(resid(mod))
qqline(resid(mod)) 