- What is polynomial regression and why is it used?
- How to check for polynomial effects
- How to specify polynomial regression terms in R
- How to check if polynomials improve the model fit
- How to interpret polynomial regression terms in R
- What is Generalized Additive Modeling and why is it used?
- How to specify a GAM in R
- How to specify random intercepts and slopes in a GAM
- How to interpret a GAM
- How to assess the fit of a GAM
- Exercises
- References

- Remember that one of the assumptions of logistic and linear regression is that the numeric independent variables are linearly related to either the dependent variable or, in logistic regression, the logit of the dependent variable
- Often, however, the relationship is not linear. In previous classes, we have attempted to bring these non-linear relationships to a linear form with tranformations of the independent or the dependent variables
- For certain types of non-linear relationships, however, it is much more appropriate to incoporate the non-linearity in the model specification

- Let us first generate some random data

```
dataSet <- data.frame(x=rnorm(500))
dataSet$y<- rnorm(500)+ (dataSet$x+ dataSet$x^2)
ggplot(dataSet, aes(x=x, y=y)) +
geom_point()
```

- If we were to regress
`y ~x`

with a linear model, this would not be a very close fit - To solve this, we could try to find a transformation of
`x`

or`y`

(e.g.,`sqrt`

) that would render the relationship more linear - However, this will never be more than an approximation, which comes at the cost of losing some interpretability
- Also transforming the dependent variable to establish a linear relationship with an independent variable will only work if there’s only one predictor needing such a transformation

```
ggplot(dataSet, aes(x=x, y=sqrt(y))) +
geom_point() +
geom_smooth(method="lm")
```

- Another more valid option would be to incorporate the curvature into the model specification
- To do so, we can add a square transformation of
`x`

alongside of`x`

in our regression equation:`y = intercept + (x*effect1) + (x^2*effect1) + ... + error`

- In algebra, functions of the form
`f(x) = 2*x + 2*x^2`

are known as`polynomials`

- Polynomials allow us to model certain non-linear relationships for what they are
- This will not work for variables that need e.g., a log-transformation

```
ggplot(dataSet, aes(x=x, y=y)) +
geom_point() +
geom_smooth(method="lm", formula = "y ~ x + I(x^2)", color="red")
```

- The gist of the idea is that you include the same numeric predictor a number of times:
- The numeric predictor as it is, this will model the bit of the curve that is actually linear
- The numeric predictor raised to an exponent. If the exponent is 2, then you only include the predictor raised to the power of 2. If it is higher, than you include the predictor raised to the power of 2, the predictor raised to the power of 3, etc.

- The exponent is called the
`order`

of the polynomial:- Second-order polynomial: exponent 2; called
`quadratic`

. This polynomial order will turn up as a`parabola`

shape on a plot - Third-order polynomial: exponent 3, called
`cubic`

. Polynomial order will turn up as a`s-shape`

on a plot - Fourth-order polynomial: exponent 4, called
`quartic`

. This polynomial will turn up as a`u-shape`

on a plot

- Second-order polynomial: exponent 2; called

- To incorporate polynomials, we can include a call to the
`poly`

function in our formula specification - This function takes two arguments:
- Predictor column
- Order of the polynomial

- Here we will take a look at how polynomials work for
`lm`

models, but they work in the same way for`glm`

,`lmer`

, and`glmer`

models

`mod <- lm(y ~ poly(x, 2), dataSet)`

- To find polynomial effects in your data, in the case of linear regression, you need to plot your dependent variable vs your independent variable to see which form is most adequate

```
ggplot(dataSet, aes(x=x, y=y)) +
geom_point() +
geom_smooth(method="lm", formula = "y ~ x", color="red") +
geom_smooth(method="lm", formula="y ~poly(x, 2)")
```

- In the logistic case, you can plot your independent variable vs. predicted probabilities derived from a simple model that includes only your independent variable
- In this case, the relationship is not linear and a second-order polynomial appears to provide the best fit

```
# Load some data
logDataSet<-read_csv("http://www.jeroenclaes.be/statistics_for_linguistics/datasets/class3_claes_2017.csv") %>%
mutate_if(is.character, as.factor)
# Specify simple logistic model
logMod<-glm(type ~ characters_before_noun, family="binomial", logDataSet)
# Generate all possible values between minimum and maximum of independent variable
plotData <- data.frame(characters_before_noun=min(logDataSet$characters_before_noun):max(logDataSet$characters_before_noun))
# Extract predicted probabilities
plotData$predicted<-predict(logMod, newdata = plotData, type="response")
# Plot
ggplot(plotData, aes(x=characters_before_noun, y=predicted)) +
geom_point() +
geom_smooth(method="lm", formula = "y ~ x", color="red") +
geom_smooth(method="lm", formula="y ~ poly(x, 2)")
```

- In both cases, by adding a
`lm`

regression line to the plot, you can get an idea of how linear the relationships are - If you spot something that looks like it is not quite linear, you can incorporate a polynomial, by editing the
`formula`

argument of`geom_smooth`

:- e.g.,
`formula="y ~ poly(x,2)"`

. The mapping of`x`

and`y`

to columns in the dataset is made in the`aes`

mapping of the main`ggplot`

call

- e.g.,

- Which order of polynomial you include in your model is a matter of emperical adequacy.
- You will want to model your data as close as possible, but you want to avoid having too high a degree of polynomiality
- You start with second-order polynomials. Then you plot the relationship, and you explore if a higher-order polynomial is necessary
- Polynomials of orders higher than three or four are usually a bad idea

- Polynomial models are more complex than models without polynomials
- Models with higher-order polynomials are more complex than models with lower-order polynomials
- This means that we can use the AIC statistic to compare if polynomials contribute to the model fit

```
mod <- lm(y ~ x, dataSet)
mod1 <- lm(y~poly(x, 2), dataSet)
AIC(mod)-AIC(mod1)
```

`## [1] 457.6562`

```
mod2 <- lm(y~poly(x, 3), dataSet)
AIC(mod1)-AIC(mod2)
```

`## [1] -1.56036`

- Polynomial regression terms are hard to interpret:
- The effect estimates are spread over two or more variables:
- The first order of the polynomial describes the effect of
`x`

- The second order of the polynomial describes the effect of
`x^2`

- The first order of the polynomial describes the effect of

- The effect estimates are spread over two or more variables:
- We can no longer interprete the coefficients as indicating that “for each increase in X, there’s a N-unit increase in Y”
- If you want to interpret the terms, it is best to plot the relationship of the independent variable to the dependent variable. This way you can interpret the shape of the relationship and describe it in plain text

`summary(mod)`

```
##
## Call:
## lm(formula = y ~ x, data = dataSet)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.8625 -1.0729 -0.1901 0.8474 7.1301
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.92714 0.06884 13.47 <2e-16 ***
## x 1.00394 0.07066 14.21 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.538 on 498 degrees of freedom
## Multiple R-squared: 0.2884, Adjusted R-squared: 0.287
## F-statistic: 201.9 on 1 and 498 DF, p-value: < 2.2e-16
```

- For logistic models, you can plot the polynomial term vs the predicted values of the dependent variable, which can be obtained with
`predict(mod, type="response")`

```
logMod<-glm(type ~ poly(characters_before_noun, 2) + negation + Typical.Action.Chain.Pos + corpus + tense, family="binomial", logDataSet)
# Extract predicted probabilities
logDataSet$predicted<-predict(logMod, type="response")
# Plot
ggplot(plotData, aes(x=characters_before_noun, y=predicted)) +
geom_point() +
geom_smooth(method="lm", formula="y ~ poly(x, 2)")
```

- Generalized additive modeling is a technique that builds on
`generalized linear models`

(the type of model implemented in`glm`

) - Like polynomials, GAMs allow us to model non-linear relationships as non-linear relationships
- However, whereas polynomial regression models can only model non-linear relationships that can be described by a polynomial function, GAMs can handle any type of non-linearity
- The non-linear relationships are modeled by smoothing functions
- The predictors for which we specify a linear relationship are evaluated as in a ‘normal’ regression model

- The assumptions of GAMs are the same as those that we looked at for, respectively, logistic and linear regression
- The only difference is that the linearity assumption is relaxed, i.e., as long as we specify our model in a correct way so that non-linear relationships are estimated using the smoothing functions, there is no strict linearity assumption
- Note that linear relationships should be estimated using ‘normal’ regression terms

- Wieling, Nerbonne & Baayen (2011) introduced generalized additive modeling to the field of linguistics
- Their study showed that GAMs allow us to measure the non-linear influence of geography (lat, long coordinates) on a dependent variable, while controlling for other predictor variables
- GAMs have mainly been used in quantitative dialectology and sociolinguistics

- We will use data by Wieling et al. (2011)
- The dataset contains 19,446 observations of the way speakers from different Dutch localities pronounce a particular word:
`LD`

: dependent variable, which records the Levenshtein distance between the pronunciation of the speaker and standard Dutch. The Levenshtein distance measures the number of string edits that are necessary to go from one string to another. The variable was scaled and centered around zero`Location`

: Data collection site`Word`

: Word that is being pronounced`Longitude`

,`Latitude`

: Data collection site coordinates`PopCnt`

: Population count, centered and scaled`PopAge`

: Mean population age of the locality, centered and scaled`PopIncome`

: Mean population income of the locality, centered and scaled`IsNoun`

: 1 (noun), 0 (other word)`WordFreq`

: Word frequency, log-transformed`WordLength`

: Word length, log-transformed

```
library(readr)
library(dplyr)
dataSet <- read_csv("http://www.jeroenclaes.be/statistics_for_linguistics/datasets/class12_Wieling_et_al_2011.csv") %>%
mutate_if(is.character, as.factor)
summary(dataSet)
```

```
## LD Location Word Longitude
## Min. :-0.49888 Almen Gl : 48 rijst : 422 Min. :3.438
## 1st Qu.:-0.20164 Alphen NB : 48 schade : 421 1st Qu.:5.024
## Median : 0.01213 Amersfoort Ut: 48 veel : 420 Median :5.735
## Mean : 0.00000 Angerlo Gl : 48 kammen : 419 Mean :5.590
## 3rd Qu.: 0.23721 Anloo Dr : 48 wonen : 419 3rd Qu.:6.175
## Max. : 0.87059 Austerlitz Ut: 48 noten : 418 Max. :7.168
## (Other) :19158 (Other):16927
## Latitude PopCnt PopAge
## Min. :50.77 Min. :-2.6608800 Min. :-3.177204
## 1st Qu.:51.70 1st Qu.:-0.7225091 1st Qu.:-0.601558
## Median :52.12 Median :-0.0915865 Median :-0.044225
## Mean :52.21 Mean : 0.0003773 Mean :-0.003588
## 3rd Qu.:52.73 3rd Qu.: 0.6421021 3rd Qu.: 0.630994
## Max. :53.48 Max. : 3.1117973 Max. : 3.347749
##
## PopIncome IsNoun WordFreq
## Min. :-3.575051 Min. :0.000 Min. :-2.070331
## 1st Qu.:-0.586729 1st Qu.:0.000 1st Qu.:-0.503609
## Median :-0.012116 Median :0.000 Median :-0.044536
## Mean : 0.004545 Mean :0.436 Mean :-0.009755
## 3rd Qu.: 0.522902 3rd Qu.:1.000 3rd Qu.: 0.496171
## Max. : 4.555094 Max. :1.000 Max. : 3.164715
##
## WordLength
## Min. :-1.685932
## 1st Qu.:-0.498296
## Median : 0.422907
## Mean : 0.005858
## 3rd Qu.: 0.422907
## Max. : 2.363220
##
```

- GAMs can be fit in R using the
`mgcv`

package - The main function is
`bam`

- Like
`glm`

it takes three arguments:- Formula
- Distribution family:
- To fit a logistic model, we add
`family="binomial"`

- To model a normally distributed dependent variable, we add
`family="gaussian"`

- To fit a logistic model, we add
- Data

- The formula bit of a GAM is somewhat special:
- If you do not want to evaluate a predictor in a linear way, you wrap it with
`s()`

- If you want to evaluate the joint non-linear effect of two predictors on the outcome, you include the two predictors in the brackets after
`s`

, separated by comma - Here we will do this for
`Longitude`

and`Latitude`

so we can measure the effect of geography on pronunciation distance - If you use coordinates in a
`bam`

formula it is important that you order them as we did here

- If you do not want to evaluate a predictor in a linear way, you wrap it with

```
library(mgcv)
mod <- bam(LD ~ s(Longitude, Latitude) + PopCnt + PopAge + PopIncome + IsNoun + WordFreq + WordLength, family="gaussian", dataSet)
```

- In a GAM formula we can also specify random intercepts
- To do so, we wrap the variable name with
`s`

and we add the argument`bs="re"`

to the call to`s`

- This way, the
`bam`

function knows that it should treat the variable as a random intercept - To include a
`random slope`

, we add another call to`s`

and we include the intercept name and the variable name, followed by`bs="re"`

`mod <- bam(LD ~ s(Longitude, Latitude) + s(Word, bs="re") + s(Word, PopCnt, bs="re") + PopCnt + PopAge + PopIncome + IsNoun + WordFreq + WordLength, family="gaussian", dataSet)`

- The output of a
`bam`

summary has two components:- Parametric coefficients
- Smooth terms

`summary(mod)`

```
##
## Family: gaussian
## Link function: identity
##
## Formula:
## LD ~ s(Longitude, Latitude) + s(Word, bs = "re") + s(Word, PopCnt,
## bs = "re") + PopCnt + PopAge + PopIncome + IsNoun + WordFreq +
## WordLength
##
## Parametric coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.050416 0.035895 -1.405 0.160176
## PopCnt -0.011451 0.002753 -4.159 3.2e-05 ***
## PopAge 0.007375 0.001900 3.883 0.000104 ***
## PopIncome -0.002777 0.001963 -1.415 0.157196
## IsNoun 0.116820 0.057465 2.033 0.042079 *
## WordFreq 0.056376 0.028608 1.971 0.048783 *
## WordLength 0.030142 0.026523 1.136 0.255783
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Approximate significance of smooth terms:
## edf Ref.df F p-value
## s(Longitude,Latitude) 26.93 28.73 122.168 < 2e-16 ***
## s(Word) 43.81 44.00 233.598 < 2e-16 ***
## s(Word,PopCnt) 28.31 47.00 1.545 6.98e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## R-sq.(adj) = 0.446 Deviance explained = 44.9%
## fREML = -427.49 Scale est. = 0.05473 n = 19446
```

- The interpretation of the
`parametric coefficients`

is the same as for a`glm`

or`lm`

model:`Estimate`

: regression coefficient, on the same scale as the dependent variable or on the log odds scale when we fit a logistic model`Std. Error`

: standard error for the regression coefficient. Large standard errors point to less certainty about the effect estimate`t-value`

(z-value in the logistic case): test statistic value that is used to calculate the significance of the effect`Pr(>|t|)`

: p-value of the effect estimate

- For the
`smooth terms`

we only get information on their significance - However, their effects on the dependent variable can be plotted
- One of the main uses of GAMs in linguistics so far has been to create maps of the regions in space where you can find e.g., a more distinct pronounciation, while controlling for other variables such as population size, income, word, etc.

- Plotting the effects of a smooth term can be done with the
`vis.gam`

function - To plot a map of
`LD`

over space while controlling for the other variables, we can use the following code - The result looks like a map of The Netherlands. The red lines separate regions for which the model estimates lower/higher Levenshtein distances

`vis.gam(mod, plot.type="contour", color="terrain", too.far=0.05, view=c("Longitude","Latitude"))`

- Like the summary of
`lm`

models, the summary output of`bam`

models includes an adjusted r-squared measure - This r-squared measure is also provided for logistic models
- In addition, for logistic models, we can calculate the c-index of concordance, using the
`Hmisc`

package

```
#not run
library(Hmisc)
logisticMod
somers2(fitted(logisticMod), as.numeric(data$dependent)-1)
```

- Please go to http://www.jeroenclaes.be/statistics_for_linguistics/class12.html and perform the exercise

- Wieling, M., Nerbonne, J., & Baayen, H. R. (2011). Quantitative Social Dialectology: Explaining Linguistic Variation Geographically and Socially.
*Plos One*. DOI: https://doi.org/10.1371/journal.pone.0023613.