- What is linear regression analysis and why is it used?
- The Least Squares Regression equation
- Assumptions of regression modeling
- Validating the assumptions of regression modeling
- Exercises
- References

- We will be working with an extended version of the dataset by Balota et al. (2007) we have seen before (Levshina, 2015):
`Word`

`Length`

: word length`SUBTLWF`

: normalized word frequency`POS`

: part-of-speech`Mean_RT`

: Mean reaction time

- Research question:
*To what extent and how do word length, word frequency, and part-of speech condition subjects’ reaction times in a lexical decision task?*

```
library(readr)
library(dplyr)
library(ggplot2)
dataSet <- read_csv("http://www.jeroenclaes.be/statistics_for_linguistics/datasets/class7_Balota_et_al_2007.csv")
glimpse(dataSet)
```

```
## Observations: 880
## Variables: 5
## $ Word <chr> "rackets", "stepmother", "delineated", "swimmers", "um...
## $ Length <int> 7, 10, 10, 8, 6, 5, 5, 8, 8, 6, 8, 12, 8, 6, 7, 3, 3, ...
## $ SUBTLWF <dbl> 0.96, 4.24, 0.04, 1.49, 1.06, 3.33, 0.10, 0.06, 0.43, ...
## $ POS <chr> "NN", "NN", "VB", "NN", "NN", "NN", "VB", "NN", "NN", ...
## $ Mean_RT <dbl> 790.87, 692.55, 960.45, 771.13, 882.50, 645.85, 760.29...
```

- Linear regression is a form of statistical modeling:
- We model the values of a continous variable as a function of the values of one or more explanatory variables
- E.g., Grammatical complexity can be understood as a function of age, speakers vowel quality can be understood as a function of social variables and contextual features,…

- A linear regression model is a statistical implementation of a series of hypotheses about the way something works

- We model the values of a continous variable as a function of the values of one or more explanatory variables
- In its simplest form, linear regression is closely related to correlation analysis:
- Correlation analysis
**describes**the strenghts and significance of the associations between two continuous variables - Linear regression attempts to
**predict**the values of a continous variable using the values of another continuous variable

- Correlation analysis

- Linear regression has two main purposes:
- It is used to model a quantity as the additive effect of
**several variables**- E.g., Reaction times can be modeled as the joint effect of word length, word frequency, age, years of schooling

- It is used to measure the effect of
**one variable**while**controlling**for biases caused by other variables so as to make the effect estimates more meaningful and precise- E.g., If we want to investigate the effect of
*word length*on reaction times across different subjects, we may want to control for age, word frequency, years of schooling, etc.

- E.g., If we want to investigate the effect of

- It is used to model a quantity as the additive effect of

- Linear regression predicts a linear trend line for Y at each point of X
- As you can see, the line does not touch most points.
- The regression equation estimates the
**slope**(steepness) that minimizes the squared sum of the differences between the observed values (the dots) and the predicted values (the line):- Differences between the predicted and the observed values are called
**residuals** - The squared sum of the residuals is called the
**sum of squares** - Linear regression is also called
**ordinary least squares**regression

- Differences between the predicted and the observed values are called

- Least squares regression is a very powerful technique, but it has some limitations
- In order to truly understand its limitations, it is necessary to have a high-level understanding of what the technique actually does
- Here, we will consider only single-term regression:
`Y ~ X`

- In a single-term regression, there are three elements that enter the equation:
- The
`slope coefficient`

(the steepness of the line on the plot) - The
`intercept`

: the predicted value of Y in X=0; the Y value where the line crosses/intercepts the Y-axis - The predicted value of
`Y`

- The
- Let us take a deeper look at each one of these elements

`y.predicted <- (slope*X) + intercept`

- The
`slope coefficient`

equals the`Pearson product-moment correlation coefficient`

of the relationship between X and Y, multiplied by the standard deviation of Y divided by the standard deviation of X - This brings the correlation coefficient more or less to the same scale as X and Y

```
slope <- cor(dataSet$Length, dataSet$Mean_RT) * (sd(dataSet$Mean_RT)/sd(dataSet$Length))
slope
```

`## [1] 28.2816`

- The
`intercept`

equals the average value of Y minus the average value of X multiplied by the slope coefficient

```
intercept <- mean(dataSet$Mean_RT) - (slope*mean(dataSet$Length))
intercept
```

`## [1] 554.3454`

- With these values and our formula we can hand-draw the regression line for X

```
# We generate all values between the min and the max of Length
X<- min(dataSet$Length):max(dataSet$Length)
# We apply our equation
y.predicted <-(slope*X)+intercept
```

- If we store the values in a
`data.frame`

we can easily plot the observed and the predicted values

```
values <- data.frame(x=X, y=y.predicted)
ggplot(dataSet, aes(x=Length, y=Mean_RT)) +
geom_point() +
geom_line(data=values, aes(x=x, y=y), color="red")
```

- With more than one predictor variable, the equation remains identical, but calculating the intercept becomes more complicated:
- In
`multiple regression`

or`multivariate regression`

the intercept equals the value of the dependent variable when all predictor variables are zero - The predicted value is calculated as the sum of the intercept and the slope estimate for each variable
- The geometric form of the equation is no longer a line, but a
*hyperplane*, a k-dimensional shape in k-dimensional space

- In
- In most cases you will want to construct a multivariate model to parcel out the effects of each individual variable

`y.predicted <- intercept + (slope1*dataSet$Length) + (slope2*dataSet$SUBTLWF) + ...`

- The
`lm`

(“linear model”) function in R will take care of calculating the intercept, slope and predicted values for us - It uses
`formula notation`

:- The leftmost string is the dependend variable (
`Mean_RT`

) - The
`tilde`

marks that the right-hand variable names are independent variables - The
`+`

signs refer to the additive effect of the variables in the least squares equation

- The leftmost string is the dependend variable (
- Don’t forget to specify the data.frame that holds your data!
- If your model includes all the columns in your data.frame, you can use short-hand notation:
`Mean_RT ~ .`

`mod <- lm(Mean_RT ~ Length + SUBTLWF + POS, data = dataSet)`

- The least-squares regression equation immediately exposes some of its assumptions and limitations:
- The slope is calculated using the Pearson product-moment correlation coefficient:
- The technique will inherit the assumptions of the Pearson product-moment correlation coefficient (see Topic 5):
- Y is interval- or ratio-scaled (there are ways of dealing with ordinal or categorical data in X)
- The relationship between X and Y is
**monotonic** - The relationship between X and Y is
**linear** - There are
**no outliers**in X and Y

- The technique will inherit the assumptions of the Pearson product-moment correlation coefficient (see Topic 5):

- The slope is calculated using the Pearson product-moment correlation coefficient:

- The least-squares regression equation immediately exposes some of its assumptions and limitations:
- The slope is calculated using the standard deviations of X and Y:
- X and Y should follow the
`normal distribution`

, otherwise their standard deviations will be biased and the resulting slope estimate will be biased as well

- X and Y should follow the
- The intercept is calculated using the
`means`

of X and Y:- X and Y will have to be normally distributed, or the means will not be good indicators of their central tendencies, resulting in poor intercept estimates

- The slope is calculated using the standard deviations of X and Y:

- The least-squares regression equation immediately exposes some of its assumptions and limitations:
- The predicted value of
`Y`

is calculated by adding up`(slope*Predictor1) + (slope*Predictor2) ...`

:- The effects of the independent variables should be additive, this means that the independent variables cannot be strongly correlated (there is no
`multicollinearity`

)

- The effects of the independent variables should be additive, this means that the independent variables cannot be strongly correlated (there is no

- The predicted value of

- Levshina (2015: 155) lists the following additional assumptions:
- The observations are independent from one another (there is no
`autocorrelation`

) - The variability of the residuals does not increase or decrease with the explanatory variables or the response (i.e., there is no
`heteroskedasticity`

) - The residuals are not autocorrelated
- The residuals follow the normal distrubtion. This is less important if the sample size is large

- The observations are independent from one another (there is no

- Before you think of fitting a linear regression model, you have to:
- Investigate the distributions of all of your variables:
- What are their shapes: are they normally distributed?
- What are their central tendencies?
- What is their dispersion?
- Are there any outliers?

- Investigate the distributions of all of your variables:

- Before you think of fitting a linear regression model, you have to:
- Examine the relationship of the independent variables to your dependent variable:
- Are they linear?
- Are they homeoskedastic?

- Examine the relationships between each of your individual predictors:
- Are they correlated? To which extent?

- Examine the relationship of the independent variables to your dependent variable:
- After checking all of this, you can specify a first regression model to validate the regression assumptions further

- Each value of the response variable should be independent from the values of other observations:
- E.g., the reaction times for
*marveled*should not depend on the reaction times for another word - For study designs where you gather multiple observations for the same subject/word, a
`mixed-effects regression model`

(a model that takes this fact into account) is more appropriate than the simple linear model

- E.g., the reaction times for

- Outliers may distort the regression analysis and should be removed:
- A boxplot of each of your variables will help you identify outliers

- A
`density plot`

or a`QQ-plot`

will help you estimate the shape of the distribution of each variable- Depending on the shape of the distribution of your variable, you may want to:
- For normally distributed variables:
- Remove outliers based on z-scores

- For data that does not follow the normal distribution:
- Remove outliers based on MAD-scores or the interquartile range

- For normally distributed variables:

- Depending on the shape of the distribution of your variable, you may want to:

- Remove outliers based on z-scores:
- Here our dependent
`Mean_RT`

variable and our`Length`

variable are mostly normally distributed, except for a couple of atypical observations - Values that provide an absolute z-score of more than 2 are outliers (2 standard deviations higher/lower than the mean)

- Here our dependent

```
dataSet<-dataSet[abs(scale(dataSet$Mean_RT)) <= 2, ]
dataSet<-dataSet[abs(scale(dataSet$Length)) <= 2, ]
```

- Remove outliers based on MAD-scores/interquartile range:
- The
`SUBTLWF`

variable has the typical power-law distribution of word frequencies - Here we have two possibilities:
- We exclude all variables that have a value smaller/higher than the first/third quartile minus/plus
`1.5 times the interquartile range`

. Note that we have to use`which`

to find those values, because we need to specify an`AND`

statement (written as`&`

) - We exclude all variables that have an absolute value greater than
`3.5`

when the median is substracted from the value and divided by the median absolute deviation

- We exclude all variables that have a value smaller/higher than the first/third quartile minus/plus

- The

```
dataSet <- dataSet[abs((dataSet$SUBTLWF-median(dataSet$SUBTLWF))/mad(dataSet$SUBTLWF)) <= 3.5, ]
# dataSet<- dataSet[which(dataSet$SUBTLWF >= quantile(dataSet$SUBTLWF, 0.25) - (1.5*IQR(dataSet$SUBTLWF)) &
# dataSet$SUBTLWF <= quantile(dataSet$SUBTLWF, 0.75) + (1.5*IQR(dataSet$SUBTLWF)) ),]
# We refit our model to the changed data
mod <- lm(Mean_RT ~ Length + SUBTLWF + POS, data = dataSet)
```

- After these steps, we may define an intial regression model
- The
`influencePlot`

function from the`car`

package identifies the row indices of too influential observations

- The plot combines three values:
- Studentized residuals (Y-axis). These values represent the difference between observed and predicted values. Values more extreme than
`2`

should be investigated - Hat-values or leverage (X-axis). How much influence an observation may have on the predicted values

- Cook’s distance (size of the bubbles). How much influence removing an observation may have on the regression coefficients

- Studentized residuals (Y-axis). These values represent the difference between observed and predicted values. Values more extreme than

```
library(car)
influencePlot(mod)
```

```
## StudRes Hat CookD
## 58 3.777528 0.01748908 0.049795555
## 229 1.395448 0.02453357 0.009780932
```

- The function returns a data.frame with the statistics for the most suspicious observations
- The
`row.names`

of the data.frame are the indices of the observations that should be removed - It is usually a good idea to remove these

```
library(car)
indices <-row.names( influencePlot(mod))
```

```
indices <-as.numeric(indices)
dataSet <- dataSet[-indices, ]
# We refit our model to the changed data
mod <- lm(Mean_RT ~ Length + SUBTLWF + POS, data = dataSet)
```

- The first thing you should do is plotting the relationship of your numeric predictors to the dependent variable to fix the easy to spot non-linear trends
- For regression models the
`crPlot`

function in the`car`

package plots the relationship between X and Y - There are two lines on the plot:
- The dashed red line plots the expected linear relationship
- The green line plots the observed relationship

```
library(car)
crPlot(mod, var = "Length")
```

- If the two lines diverge greatly, the relationship is not linear
- The relationships are not perfectly linear, but they come pretty close
- Non-linear relationships may be made linear with transformations of the predictor variable
- In other cases, it is more appropriate to model them with
`polynomial regression terms`

, linear approximations of the curvy relationship (see Topic 12). This makes the model less easily interpretable

```
library(car)
crPlot(mod, var = "SUBTLWF")
```

- The previous two plots reveal that there is a bigger issue than some non-linearity: heteroskedasticity. Both for the
`Length`

and the`SUBTLWF`

variable the plots display a funnel-like pattern - This means that the variability of the residuals covaries (i.e, increases/decreases) with the explanatory variables or the response

`plot(mod, which = 1)`

- The
`car`

package includes the`ncvTest`

function - If p < 0.05, the fitted (predicted) values have non-constant variance with the residuals
- This test establishes whether or not heteroskedasticity is an issue for the model
- Here it is clear that our model suffers from heteroskedasticity

```
library(car)
ncvTest(mod)
```

```
## Non-constant Variance Score Test
## Variance formula: ~ fitted.values
## Chisquare = 7.719775 Df = 1 p = 0.005461922
```

- We should also check whether the residuals have constant variance depending on the values of quantitative explanatory variables
- We perform the same
`ncvTest`

test, but we add the predictor variable - This allows us to check if the problem is due to specific variables. Here we find that the
`SUBTLWF`

variable is the main culprit

```
library(car)
ncvTest(mod, ~ Length)
```

```
## Non-constant Variance Score Test
## Variance formula: ~ Length
## Chisquare = 2.077236 Df = 1 p = 0.1495107
```

`ncvTest(mod, ~ SUBTLWF)`

```
## Non-constant Variance Score Test
## Variance formula: ~ SUBTLWF
## Chisquare = 5.790808 Df = 1 p = 0.01611018
```

`ncvTest(mod, ~POS)`

```
## Non-constant Variance Score Test
## Variance formula: ~ POS
## Chisquare = 1.986296 Df = 2 p = 0.3704088
```

- Heteroskedasticity can be fixed with a power transformation of the
**dependent variable** - We can try to find the right transformation by trial and error, or we can use a
`boxCox`

plot (from the`car`

package) - The X-value with the highest Y-value corresponds to the power of the optimal transformation for our dependent variable
- If the Y-value is zero, the optimal transformation is the natural logarithm

```
library(car)
boxCox(mod)
```

- We can find the exact x-value that corresponds with the highest y-value as follows:
- We store the data for the plot in an object
- We set
`plotit`

to`FALSE`

so we don’t get a plot output - We subset the x-vector by calling
`which.max`

(which vector item has the highest value) on the y-values

```
library(car)
testValues <- boxCox(mod, plotit=FALSE)
power <- testValues$x[which.max(testValues$y)]
power
```

`## [1] -0.6`

- All we have to do now to fix the problem is raising
`Mean_RT`

to the power of`-0.6`

and compute the model again - When we perform the
`ncvTest`

again, the p-value is higher than 0.05, indicating that heteroskedasticity is no longer an issue

```
library(car)
dataSet$Mean_RT <- dataSet$Mean_RT^power
mod <- lm(Mean_RT ~ Length + SUBTLWF, data = dataSet)
ncvTest(mod)
```

```
## Non-constant Variance Score Test
## Variance formula: ~ fitted.values
## Chisquare = 0.2232123 Df = 1 p = 0.6366031
```

`multicollinearity`

)- For numerical variables:
- Perform pairwise correlation tests (see Topic 5)
- You will want to see p-values higher than 0.05 and/or weak to moderate correlations

- For categorical variables:
- Perform pariwise Chi-squared tests of independence and calculate the effect sizes (see Topic 6)
- You will want to see p-values higher than 0.05 and/or small to moderate effect sizes

- You can also assess multicollinearity directly from a fitted regression model:
- Calculate the Variance Inflation Factors (VIF)
`VIF <= 5`

indicates that there is no severe multicollinearity`vif`

in the`car`

package

```
library(car)
vif(mod)
```

```
## Length SUBTLWF
## 1.019339 1.019339
```

`multicollinearity`

)- If some of your variables are collinear you should drop one of them
- If it is of theoretical interest to investigate both variables, you can run parallel regression models, which each include one of the variables
- If interpretability is not of primary concern (say, when the collinear variables only control for biases), techniques such as
`principal components`

(for quantitative variables) or`factor analysis`

(for factors) can combine the two collinear variables into one or more non-correlated variables

- Just like the observations cannot be autocorrelated, the residuals cannot be autocorrelated either
- To detect autocorrelation, we can perform a
`durbinWatsonTest`

`durbinWatsonTest`

in the`car`

package- If p > 0.05 autocorrelation is not an issue

```
library(car)
durbinWatsonTest(mod)
```

```
## lag Autocorrelation D-W Statistic p-value
## 1 0.04514735 1.908066 0.252
## Alternative hypothesis: rho != 0
```

- If the residuals are autocorrelated this may mean that:
- We did not include a variable that ought to be included
**Solution**: Think, plot, and add the predictor

- We misspecified the shape of some critical relationship. This happens when we fit a straight line for a predictor that has a non-linear relationship to the dependent variable.
**Solution**: Add a non-linear (polynomial) term (see Topic 12)

- We have made some consistent error while measuring the dependent variable
**Solution**: Find the observations with errors and exclude them

- We did not include a variable that ought to be included
- Autocorrelation will not affect the sizes and the numeric stability of the regression coefficients, but it will affect their standard errors, t-values, and p-values
- Autocorrelation will also affect the predictive accuracy of your model

- If everything is in order, the residuals (the “prediction errors” of our model) should be completely random, resulting in normally distributed values
- If the residuals are not normally distributed, this could indicate that our model misses important information
- E.g., We may have failed to include an important variable

- To inspect the residuals you can simply draw a
`qqplot`

with a`qqline`

of the residuals (`resid`

function). - Alternatively, you can draw the plot with
`plot(mod, which = 2)`

, which adds index labels for problematic observations

```
qqnorm(resid(mod))
qqline(resid(mod))
```

- Please go to http://www.jeroenclaes.be/statistics_for_linguistics/class7.html and perform the exercises

- Balota, D.A., Yap, M.J., & Cortese, M.J., et al. (2007). The English Lexicon Project.
*Behavior Research Methods*, 39(3), 445–459. DOI: 10.3758/BF03193014. Data taken from Levshina (2015). - Levshina, N. (2015).
*How to do Linguistics with R: Data exploration and statistical analysis*. Amsterdam/Philadelphia, PA: John Benjamins. - Urdan, T. (2010).
*Statistics in Plain English*. New York NY/London: Routledge.