# Contents

1. What is polynomial regression and why is it used?
2. How to check for polynomial effects
3. How to specify polynomial regression terms in R
4. How to check if polynomials improve the model fit
5. How to interpret polynomial regression terms in R
6. What is Generalized Additive Modeling and why is it used?
7. How to specify a GAM in R
8. How to specify random intercepts and slopes in a GAM
9. How to interpret a GAM
10. How to assess the fit of a GAM
11. Exercises
12. References

# 1. What is polynomial regression and why is it used? (1/6)

• Remember that one of the assumptions of logistic and linear regression is that the numeric independent variables are linearly related to either the dependent variable or, in logistic regression, the logit of the dependent variable
• Often, however, the relationship is not linear. In previous classes, we have attempted to bring these non-linear relationships to a linear form with tranformations of the independent or the dependent variables
• For certain types of non-linear relationships, however, it is much more appropriate to incoporate the non-linearity in the model specification

# 1. What is polynomial regression and why is it used? (2/6)

• Let us first generate some random data
dataSet <- data.frame(x=rnorm(500))
dataSet$y<- rnorm(500)+ (dataSet$x+ dataSet$x^2) ggplot(dataSet, aes(x=x, y=y)) + geom_point()  # 1. What is polynomial regression and why is it used? (3/6) • If we were to regress y ~x with a linear model, this would not be a very close fit • To solve this, we could try to find a transformation of x or y (e.g., sqrt) that would render the relationship more linear • However, this will never be more than an approximation, which comes at the cost of losing some interpretability • Also transforming the dependent variable to establish a linear relationship with an independent variable will only work if there’s only one predictor needing such a transformation ggplot(dataSet, aes(x=x, y=sqrt(y))) + geom_point() + geom_smooth(method="lm") # 1. What is polynomial regression and why is it used? (4/6) • Another more valid option would be to incorporate the curvature into the model specification • To do so, we can add a square transformation of x alongside of x in our regression equation: • y = intercept + (x*effect1) + (x^2*effect1) + ... + error • In algebra, functions of the form f(x) = 2*x + 2*x^2 are known as polynomials • Polynomials allow us to model certain non-linear relationships for what they are • This will not work for variables that need e.g., a log-transformation ggplot(dataSet, aes(x=x, y=y)) + geom_point() + geom_smooth(method="lm", formula = "y ~ x + I(x^2)", color="red") # 1. What is polynomial regression and why is it used? (5/6) • The gist of the idea is that you include the same numeric predictor a number of times: • The numeric predictor as it is, this will model the bit of the curve that is actually linear • The numeric predictor raised to an exponent. If the exponent is 2, then you only include the predictor raised to the power of 2. If it is higher, than you include the predictor raised to the power of 2, the predictor raised to the power of 3, etc. # 1. What is polynomial regression and why is it used? (6/6) • The exponent is called the order of the polynomial: • Second-order polynomial: exponent 2; called quadratic. This polynomial order will turn up as a parabola shape on a plot • Third-order polynomial: exponent 3, called cubic. Polynomial order will turn up as a s-shape on a plot • Fourth-order polynomial: exponent 4, called quartic. This polynomial will turn up as a u-shape on a plot # 1. What is polynomial regression and why is it used? (6/6) # 2. How to fit polynomial terms in R # 2. How to fit polynomial terms in R • To incorporate polynomials, we can include a call to the poly function in our formula specification • This function takes two arguments: • Predictor column • Order of the polynomial • Here we will take a look at how polynomials work for lm models, but they work in the same way for glm, lmer, and glmer models mod <- lm(y ~ poly(x, 2), dataSet) # 3. How to check for polynomial effects # 3. How to check for polynomial effects: linear models • To find polynomial effects in your data, in the case of linear regression, you need to plot your dependent variable vs your independent variable to see which form is most adequate ggplot(dataSet, aes(x=x, y=y)) + geom_point() + geom_smooth(method="lm", formula = "y ~ x", color="red") + geom_smooth(method="lm", formula="y ~poly(x, 2)") # 3. How to check for polynomial effects: logistic models • In the logistic case, you can plot your independent variable vs. predicted probabilities derived from a simple model that includes only your independent variable • In this case, the relationship is not linear and a second-order polynomial appears to provide the best fit # Load some data logDataSet<-read_csv("http://www.jeroenclaes.be/statistics_for_linguistics/datasets/class3_claes_2017.csv") %>% mutate_if(is.character, as.factor) # Specify simple logistic model logMod<-glm(type ~ characters_before_noun, family="binomial", logDataSet) # Generate all possible values between minimum and maximum of independent variable plotData <- data.frame(characters_before_noun=min(logDataSet$characters_before_noun):max(logDataSet$characters_before_noun)) # Extract predicted probabilities plotData$predicted<-predict(logMod, newdata = plotData, type="response")
# Plot
ggplot(plotData, aes(x=characters_before_noun, y=predicted)) +
geom_point() +
geom_smooth(method="lm", formula = "y ~ x", color="red") +
geom_smooth(method="lm", formula="y ~ poly(x, 2)")

# 3. How to check for polynomial effects: the code

• In both cases, by adding a lm regression line to the plot, you can get an idea of how linear the relationships are
• If you spot something that looks like it is not quite linear, you can incorporate a polynomial, by editing the formula argument of geom_smooth:
• e.g., formula="y ~ poly(x,2)". The mapping of x and y to columns in the dataset is made in the aes mapping of the main ggplot call

# 4. How to check if the polynomials improve the model fit (1/2)

• Which order of polynomial you include in your model is a matter of emperical adequacy.
• You will want to model your data as close as possible, but you want to avoid having too high a degree of polynomiality
• You start with second-order polynomials. Then you plot the relationship, and you explore if a higher-order polynomial is necessary
• Polynomials of orders higher than three or four are usually a bad idea

# 4. How to check if the polynomials improve the model fit (2/2)

• Polynomial models are more complex than models without polynomials
• Models with higher-order polynomials are more complex than models with lower-order polynomials
• This means that we can use the AIC statistic to compare if polynomials contribute to the model fit
mod <- lm(y ~ x, dataSet)
mod1 <- lm(y~poly(x, 2), dataSet)
AIC(mod)-AIC(mod1)
## [1] 457.6562
mod2 <- lm(y~poly(x, 3), dataSet)
AIC(mod1)-AIC(mod2)
## [1] -1.56036

# 5. How to interpret polynomial regression terms in R (1/2)

• Polynomial regression terms are hard to interpret:
• The effect estimates are spread over two or more variables:
• The first order of the polynomial describes the effect of x
• The second order of the polynomial describes the effect of x^2
• We can no longer interprete the coefficients as indicating that “for each increase in X, there’s a N-unit increase in Y”
• If you want to interpret the terms, it is best to plot the relationship of the independent variable to the dependent variable. This way you can interpret the shape of the relationship and describe it in plain text
summary(mod)
##
## Call:
## lm(formula = y ~ x, data = dataSet)
##
## Residuals:
##     Min      1Q  Median      3Q     Max
## -3.8625 -1.0729 -0.1901  0.8474  7.1301
##
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)  0.92714    0.06884   13.47   <2e-16 ***
## x            1.00394    0.07066   14.21   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.538 on 498 degrees of freedom
## Multiple R-squared:  0.2884, Adjusted R-squared:  0.287
## F-statistic: 201.9 on 1 and 498 DF,  p-value: < 2.2e-16

# 5. How to interpret polynomial regression terms in R (2/2)

• For logistic models, you can plot the polynomial term vs the predicted values of the dependent variable, which can be obtained with predict(mod, type="response")
logMod<-glm(type ~ poly(characters_before_noun, 2) + negation + Typical.Action.Chain.Pos + corpus + tense, family="binomial", logDataSet)
# Extract predicted probabilities