- What is logistic regression and why is it used?
- Preliminary analyses
- Design choices
`glm`

: the Generalized Linear Model- Assumptions of logistic regression
- How to fit logistic regression models in R
- How to interpret logistic regression models
- How to assess the fit of logistic regression models
- Testing for overfitting and numerical stability: the bootstrap
- Reporting on a logistic regression analysis
- Exercises
- References

- Recall that the
`linear regression model`

is a mathematical formula that allows us to calculate the value of the dependent variable as the sum of the linear effects of the independent variables, plus the intercept and the residual (prediction error):`y = intercept + (x1*effect1) + (x2*effect2) + ... + error`

- The
`linear regression model`

assumes that the dependent variable may range between`-Inf`

and`+Inf`

and follows the normal distribution

- Logistic regression is used to model discrete variables e.g.,
`male`

vs.`female`

. - Logistic regression predicts the
`probability`

that a particular observation belongs to one of the classes - Logistic regression can be applied to binary problems or to problems with more than two categories
- Here we will focus on the binary case

- Descrete variables impose problems for the linear model:
- The dependent variable is not normally distributed (it has a
`binomial distribution`

) - Rather than ranging between
`-Inf`

and`+Inf`

(as ratio- or interval-scaled variables), for each combination of predictors, the two discrete values of the dependent variable occur with a certain probability- E.g., Say we want to predict the gender of people as a binary opposition between
`males`

and`females`

(however controversional that may be nowadays). We record hair length, muscle strength, use of lip gloss, and absence/presence of pink clothing elements. Given a large enough sample, for each combination of these variables, we will discover that a percentage of our sample is male

- E.g., Say we want to predict the gender of people as a binary opposition between

- The dependent variable is not normally distributed (it has a

- The linear regression equation would predict values between
`-Inf`

and`+Inf`

, rather than a probability of`male`

vs`female`

- Logistic regression solves this (Image credit: https://www.datacamp.com/community/tutorials/logistic-regression-R)

- Another problem that arises is that the independent variables are not linearly related to the probilities of the dependent variable
- The result would be a terribly inaccurate model

- Logistic regression solves the two problems outlined in the previous slides:
- The model formula is adapted so that it only generates probabilities between 0 and 1
- The probabilities express the likelihood that the second class occurs (
`P2`

hereafter) - A probability greater than 0.5 can be interpreted as a classification vote for
`class 2`

- The probabilities express the likelihood that the second class occurs (
- For binary dependent variables, the logistic regression formula is defined as follows, in which
`e`

is the base of the natural logarithm,`2.71828`

: `P2 = 1/1+e^-(intercept + (x1*effect1) + (x2*effect2) + ... + error)`

- The exponent of
`e`

looks familiar: it’s the linear model!

- The model formula is adapted so that it only generates probabilities between 0 and 1

- The previous slide shows that the relationship between the probabilities and the dependent variables is not linear: rather than using the linear model to estimate the value of the probability of the second class, we used it to estimate the exponent of
`e`

- To make the relationship linear, we can flip the equation: rather than specifying a non-linear form for our model that generates the predictions, we can apply a non-linear transformation to the values we are modeling
`P2/1−P2`

=`e^(intercept + (x1*effect1) + (x2*effect2) + ... + error)`

- If we take the
`log`

of the transformation of P2 we end up with a plain and simple linear model!`log(P2/(1-P2)) = intercept + (x1*effect1) + (x2*effect2) + ... + error`

- In other words, in logistic regression the independent variables are not linearly related to the probabilities of the dependent variable, but rather to a non-linear transformation of the probabilities of the dependent variable
- This transformation is called a
`link function`

. The specific form`log(P2/(1-P2))`

is called the`logit`

or the`log odds`

(as it literally is the logarithm of the odds of finding the second class)

- The previous slides do not explain how the effect estimates that appear in the formula are generated
- To calculate the predictor effect estimates for the formula, R will iteratively try to find values for each effect so that the observed values of the dependent variable are the most likely ones to occur. This process is called
`maximum likelihood estimation`

.

- In this class we will be working with an extended version of the dataset analyzed in Claes(2014, 2016):
- In standard Spanish, the NP of existential
*haber*is a direct object, for which the verb normally does not agree with plural NPs. In many varieties of Spanish, variable agreement can be found - The data records the following characteristics:
`Type`

: Singular or plural*haber*`Freq.Const.Speaker`

: Frequency of plural*haber*in the speech of the interviewee up until that point`Agt.Freq.Noun`

: Frequency with which the noun appears in an agentive role in a large spoken-language corpus`Tense`

: preterit and present vs. all others`Pr.2.Pr`

: The type of*haber*clause that was last used by the speaker, up until a 20-clause lag`Co.2.Pr`

: The type of*haber*clause that was last used by the interviewer, up until a 20-clause lag`Polarity`

: negation absent vs present`Style`

: data collection method (interview, reading task, questionnaire task)`Gender`

: gender`Age`

: 21-35 years vs. 55+ years`Education`

: University vs. less

- In standard Spanish, the NP of existential

- We use the
`dplyr`

function`mutate_if`

to convert all strings to factors, which is required for modeling

```
library(readr)
library(dplyr)
dataSet <- read_csv("http://www.jeroenclaes.be/statistics_for_linguistics/datasets/class10_Claes_2014.csv")
dataSet <- dataSet %>%
mutate_if(is.character, as.factor)
summary(dataSet)
```

```
## Type Freq.Const.Speaker Agt.Freq.Noun
## Plural :684 Min. :0.0000 Min. :0.00000
## Singular:971 1st Qu.:0.2500 1st Qu.:0.00500
## Median :0.3333 Median :0.03000
## Mean :0.3470 Mean :0.05119
## 3rd Qu.:0.4667 3rd Qu.:0.07500
## Max. :1.0000 Max. :0.30000
## Tense Pr.2.Pr Co.2.Pr
## All.Others :1014 No.Priming:246 No.Priming:1355
## Present-Preterit: 641 Pluralized:558 Pluralized: 175
## Singular :851 Singular : 125
##
##
##
## Polarity Style Gender
## Negative: 430 Interview : 401 Female:836
## Positive:1225 Reading.Task-Questionnaire.Task:1254 Male :819
##
##
##
##
## Age Education
## Oldest.Speakers :775 Less :669
## Youngest.Speakers:880 University:986
##
##
##
##
```

- As was the case for linear regression, we have to know our research questions, hypothesis, data and predictors really well before we start fitting our first model:
- Assess each variable separately first
- Assess the relationship of each independent variable, one at a time, with the dependent variable
- Assess the pairwise relationships between each of your predictors:
- Interactions
- Correlations/non-independence of factors

- Of course, the limiting factor here will be that the response variable is not continuous, but rather discrete

- In terms of preliminary analyses, we will want to:
- Assess each variable separately first:
- For numeric predictors:
- Obtain measures of central tendency and dispersion for numeric predictors
- Draw a boxplot
- Is the variable normally distributed?

- For categorical predictors:
- What are the proportions of the different categories?
- Are they distributed evenly?

- For numeric predictors:

- Assess each variable separately first:

- In terms of preliminary analyses, we will want to:
- Assess the relationship of each independent variable, one at a time, with the dependent variable:
- We will want to make sure that all of our combinations of the dependent and the independent variables have a frequency of more than one
- Calculate a
`chisq.test`

with the dependent variable (for factors) - For numeric predictors, divide the numeric predictor in four quartile bins with the
`dplyr`

function`ntile`

and compute a`chisq.test`

- Draw a grouped barchart

- Assess the relationship of each independent variable, one at a time, with the dependent variable:

`table(ntile(dataSet$Freq.Const.Speaker, 4), dataSet$Type)`

```
##
## Plural Singular
## 1 124 290
## 2 153 261
## 3 201 213
## 4 206 207
```

`chisq.test(ntile(dataSet$Freq.Const.Speaker, 4), dataSet$Type)`

```
##
## Pearson's Chi-squared test
##
## data: ntile(dataSet$Freq.Const.Speaker, 4) and dataSet$Type
## X-squared = 46.72, df = 3, p-value = 3.987e-10
```

- In terms of preliminary analyses, we will want to:
- Assess the relationships between all of the independent variables with each other:
- Perform pairwise correlation/Chi-squared tests
- Are the independent variables too highly correlated with one another?

- Assess the relationships between all of the independent variables with each other:

- In terms of preliminary analyses, we will want to:
- Scout for interactions with pairwise, faceted plotting
- For continuous predictors, we can again break up the continuum in four quartile groups with the
`ntile`

function

- As we did for the linear model, we have to choose how we want to handle categorical predictors:
`treatment contrasts`

: the coefficients express the change in the logit of the dependent variable with respect to the first level of the factor`sum contrasts`

: the coefficients express the change in the logit of the dependent variable with repect to the overall mean- Here, we’ll specify human-readable
`sum contrasts`

with the`contr.Sum`

function of the`car`

package

```
library(car)
options(contr=c("contr.Sum"))
```

- The most popular rule is that you should not have more coefficients in your model than the frequency of the least frequent category of your dependent variable divided by 10 (Hosmer et al., 2013: 407-408)
- E.g., Our least frequent dependent variable level (plural
*haber*) has 684 observations. Our model should not include more than`684/1O = 68`

coefficients

- E.g., Our least frequent dependent variable level (plural

`glm`

: the Generalized Linear Model`glm`

: the Generalized Linear Model (1/3)- The regression equation above showed that the logistic regression equation is actually an extension (or ‘generalization’) of the linear model:
- In the linear model, the equation is used to predict the values of the dependent variable directly
- In the logistic model, we apply a non-linear link function to our dependent variable before we apply the linear model

- The Generalized Linear Model is a set of techniques that allow us to model non-normally distributed response variables in the framework of linear regression

`glm`

: the Generalized Linear Model (2/3)- The key in each of these techniques is that the dependent variable is linearly related to the independent variables via a
`link`

function:- The
`logit`

for logistic regression, i.e., binomial data - The
`natural logarithm`

for count data (i.e., Poisson and negative binomial distributions)

- The
- In R the generalized linear model is implemented in the
`glm`

function. This function has three arguments:- The formula you wish to estimate
- The dataset
- The name of the distribution type you want to model:
`binomial`

for binary logistic regression

`glm`

: the Generalized Linear Model (3/3)- As we did for linear regression, we can use an initial model to check the assumptions of logistic regression modeling

` mod <- glm(Type ~ Freq.Const.Speaker + Agt.Freq.Noun + Tense + Pr.2.Pr + Co.2.Pr + Polarity + Style + Gender + Age +Education, family="binomial", dataSet)`

- Compared to linear regression, logistic regression has almost no assumptions (Harrell, 2015: 236):
- The observations are independent. This is an assumption that is usually violated in linguistics research, as our observations are actually grouped together by speakers and words (Johnson, 2009)
- There are not outliers or overly influential observations in the data
- The predictors are not heavily correlated (there is no
`multicollinearity`

) - The numeric predictors are linearly related to the logit of the probability of the dependent variable
- The model does not generate correct predictions in 100% of the cases (there is no
`complete separation`

) - The model does not underpredict the variability in the dataset (there is no
`overdispersion`

)

- As we did for linear regression, we should check for overly influential observations and remove them as they would otherwise distort our entire analysis
- We can use the
`influencePlot`

from the`car`

package to achieve this

```
library(car)
mod <- glm(Type ~ Freq.Const.Speaker + Agt.Freq.Noun + Tense + Pr.2.Pr + Co.2.Pr + Polarity + Style + Gender + Age +Education , family="binomial", dataSet)
influences<-influencePlot(mod)
```

`dataSet <- dataSet[-as.numeric(rownames(influences)),]`

`multicollinearity`

)- Your preliminary analyses will have provided you with a good picture of the degree of association/correlation between your predictors
- As for linear regression models, we can use the
`vif`

function from the`car`

package to check for predictors that are too heavily correlated - Recall that
`GVIF^(1/(2*Df))`

> 5 points to strong correlations between predictors that should be addressed

```
library(car)
mod <- glm(Type ~ Freq.Const.Speaker + Agt.Freq.Noun + Tense + Pr.2.Pr + Co.2.Pr + Polarity + Style + Gender + Age +Education, family="binomial", dataSet)
vif(mod)
```

```
## GVIF Df GVIF^(1/(2*Df))
## Freq.Const.Speaker 1.251422 1 1.118670
## Agt.Freq.Noun 1.091591 1 1.044793
## Tense 1.133283 1 1.064558
## Pr.2.Pr 1.360339 2 1.079970
## Co.2.Pr 2.064054 2 1.198617
## Polarity 1.098111 1 1.047908
## Style 2.570759 1 1.603359
## Gender 1.090725 1 1.044378
## Age 1.036245 1 1.017961
## Education 1.215575 1 1.102531
```

- Recall that logistic regression is still a form of linear modeling, so any linear predictor we add to the model must stand in a linear relationship to the logit of the dependent variable
- For a formal check of the linearity of a numeric predictor
`A`

, we can add the interaction`A*log(A)`

to our model. If the interaction is significant, then the predictor is**not**linearly related to the logit of the dependent variable (Field et al., 2012: 345) - Here we find that our numeric predictors are linearly related to the logit

```
mod_linearity <- glm(Type ~ Freq.Const.Speaker*log(1+Freq.Const.Speaker) + Agt.Freq.Noun*log(1+Agt.Freq.Noun) + Tense + Pr.2.Pr + Co.2.Pr + Polarity + Style + Gender + Age +Education, family="binomial", dataSet)
summary(mod_linearity)
```

```
##
## Call:
## glm(formula = Type ~ Freq.Const.Speaker * log(1 + Freq.Const.Speaker) +
## Agt.Freq.Noun * log(1 + Agt.Freq.Noun) + Tense + Pr.2.Pr +
## Co.2.Pr + Polarity + Style + Gender + Age + Education, family = "binomial",
## data = dataSet)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.6437 -0.7935 0.2806 0.7284 2.2291
##
## Coefficients:
## Estimate Std. Error
## (Intercept) 0.3347 0.4153
## Freq.Const.Speaker -22.3293 116.0338
## log(1 + Freq.Const.Speaker) 20.5556 118.5859
## Agt.Freq.Noun 2588.0242 2493.7600
## log(1 + Agt.Freq.Noun) -2604.9878 2499.2207
## TensePresent-Preterit 2.7888 0.1691
## Pr.2.PrPluralized -0.9997 0.2163
## Pr.2.PrSingular 0.1220 0.2089
## Co.2.PrPluralized -0.5567 0.2882
## Co.2.PrSingular 0.2975 0.3001
## PolarityPositive -0.5327 0.1566
## StyleReading.Task-Questionnaire.Task 0.3875 0.2521
## GenderMale 0.3841 0.1320
## AgeYoungest.Speakers -0.2466 0.1279
## EducationUniversity 0.5971 0.1447
## Freq.Const.Speaker:log(1 + Freq.Const.Speaker) 11.2889 49.4558
## Agt.Freq.Noun:log(1 + Agt.Freq.Noun) -1186.9718 1180.0805
## z value Pr(>|z|)
## (Intercept) 0.806 0.42039
## Freq.Const.Speaker -0.192 0.84740
## log(1 + Freq.Const.Speaker) 0.173 0.86238
## Agt.Freq.Noun 1.038 0.29936
## log(1 + Agt.Freq.Noun) -1.042 0.29726
## TensePresent-Preterit 16.496 < 2e-16 ***
## Pr.2.PrPluralized -4.621 3.82e-06 ***
## Pr.2.PrSingular 0.584 0.55918
## Co.2.PrPluralized -1.932 0.05338 .
## Co.2.PrSingular 0.991 0.32164
## PolarityPositive -3.402 0.00067 ***
## StyleReading.Task-Questionnaire.Task 1.537 0.12430
## GenderMale 2.910 0.00361 **
## AgeYoungest.Speakers -1.928 0.05389 .
## EducationUniversity 4.126 3.70e-05 ***
## Freq.Const.Speaker:log(1 + Freq.Const.Speaker) 0.228 0.81944
## Agt.Freq.Noun:log(1 + Agt.Freq.Noun) -1.006 0.31449
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2240.4 on 1651 degrees of freedom
## Residual deviance: 1556.9 on 1635 degrees of freedom
## AIC: 1590.9
##
## Number of Fisher Scoring iterations: 5
```

- Logistic regression models break when the data allows for a
`complete separation`

or a`quasi-complete separation`

. We do not have to test for this, because the model will not converge when this is the case

`overdispersion`

)- Overdispersion indicates a situation where the predictions of your model are less diverse (i.e., less dispersed) than the observations in your dataset
- Overdispersion can usually be fixed by incorporating the appropriate random effects structure in a mixed-effects logistic regression (see Topic 11)
- If our model suffers from overdispersion, the residual deviance will be larger than the residual degrees of freedom (Speelman, 2014)

`mod$deviance > mod$df.residual`

`## [1] FALSE`

- “Successful modeling of a complex data set is part science, part statistical methods, and part experience and common sense.” (Hosmer et al., 2013: 89)

- The idea is the same as for linear regression models:
- We get to know our data very well
- We include all predictors that are relevant for our hypotheses, as well as all the interactions of interest
- We fit an initial model that we use to validate the model assumptions
- To this model we apply Occam’s razor:
- We eliminate predictors that have small effect sizes, large standard errors, and high p-values one at a time.
- After each elimination, we run the model again and we compare the AIC value of the new model with that of the original model or a previous more parsimonious model
- If the elimination results in a drop of 2 AIC units or more, we have more evidence in favor of the model without the predictor

- For this pre-final model, we perform a bootstrap validation with 1,000 replicates. If there are coefficients that are not stable, remove them from the model and redo the bootstrap

- Recall that we want to start by eliminating interactions, before we move on to the main effects
`Co.2.Pr`

has a high p-value, so let’s exclude that one- The resulting model has a AIC value that is -2.4212556 units larger, so the exclusion of the variable does not appear to be warranted

```
mod1 <- glm(Type ~ Freq.Const.Speaker + Agt.Freq.Noun + Tense + Pr.2.Pr + Polarity + Style + Gender + Age +Education, family="binomial", dataSet)
AIC(mod)-AIC(mod1)
```

`## [1] -2.421256`

```
mod <- glm(Type ~ Freq.Const.Speaker + Agt.Freq.Noun + Tense + Pr.2.Pr + Polarity + Style + Gender + Age +Education + Co.2.Pr, family="binomial", dataSet)
summary(mod)
```

```
##
## Call:
## glm(formula = Type ~ Freq.Const.Speaker + Agt.Freq.Noun + Tense +
## Pr.2.Pr + Polarity + Style + Gender + Age + Education + Co.2.Pr,
## family = "binomial", data = dataSet)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.5808 -0.8068 0.2892 0.7430 2.1726
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.08845 0.35692 -0.248 0.804284
## Freq.Const.Speaker -0.56511 0.40017 -1.412 0.157898
## Agt.Freq.Noun -3.43917 1.00890 -3.409 0.000652
## TensePresent-Preterit 2.79306 0.16629 16.797 < 2e-16
## Pr.2.PrPluralized -0.94892 0.21570 -4.399 1.09e-05
## Pr.2.PrSingular 0.15888 0.20921 0.759 0.447607
## PolarityPositive -0.49437 0.15462 -3.197 0.001387
## StyleReading.Task-Questionnaire.Task 0.28641 0.24460 1.171 0.241621
## GenderMale 0.40653 0.13004 3.126 0.001770
## AgeYoungest.Speakers -0.24065 0.12713 -1.893 0.058359
## EducationUniversity 0.61660 0.14054 4.387 1.15e-05
## Co.2.PrPluralized -0.53009 0.28490 -1.861 0.062802
## Co.2.PrSingular 0.28388 0.29926 0.949 0.342812
##
## (Intercept)
## Freq.Const.Speaker
## Agt.Freq.Noun ***
## TensePresent-Preterit ***
## Pr.2.PrPluralized ***
## Pr.2.PrSingular
## PolarityPositive **
## StyleReading.Task-Questionnaire.Task
## GenderMale **
## AgeYoungest.Speakers .
## EducationUniversity ***
## Co.2.PrPluralized .
## Co.2.PrSingular
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2240.4 on 1651 degrees of freedom
## Residual deviance: 1568.3 on 1639 degrees of freedom
## AIC: 1594.3
##
## Number of Fisher Scoring iterations: 5
```

`Estimate`

: regression coefficient, on the log odds scale.- The coefficients always take the vantage point of the
**second**level of your dependent variable. If you need your coefficients to take the vantage point of the other level, you need to use`dataSet$Type<-relevel(dataSet$Type, ref=2)`

to change the ordering of the levels

- The coefficients always take the vantage point of the
`Std. Error`

: standard error for the regression coefficient. Large standard errors point to less certainty about the effect estimate`z-value`

: test statistic value that is used to calculate the significance of the effect`Pr(>|z|)`

: p-value of the effect estimate`Null deviance`

: likelihood measure of the model with only an intercept term. This model just predicts the most frequent outcome category every time`Residual deviance`

: likelihood measure of the fitted model. The difference between the null deviance and the resiudual deviance shows you how much your predictors have improved the model’s capacities with respect to the null model. Lower numbers indicate a better fit

- Linear regression coefficients are reported on the same scale as the Y-axis
- Logistic regression coefficients are reported on the
`log odds`

(or logit) scale - The
`logit`

is the logarithm of the odds of finding the**second**level of our dependent variable (singular, in this case) vs the first level:- E.g., Singular
*haber*occurs with -0.9489203 log odds in contexts primed by the speaker with plural*haber*. This corresponds with an odds of 0.3871588 to one, or a probability of 0.279102 - E.g., Singular
*haber*occurs with 0.1588754 log odds in contexts primed by the speaker with singular*haber*. This corresponds with an odds of 1.1721918 to one, or a probability of 0.5396355

- E.g., Singular

- To convert log odds to odds, we can exponentiate them with
`exp`

- To convert log odds to probabilities, we can use the
`plogis`

function

- To assess the fit of linear regression models, we used R-squared. However, this measure is not as straighforwardly interpreted in logistic regression as it is in linear regression, where it represents the amount of variance explained
- We can calculate a pseudo-R-squared for
`glm`

models with the`r.squaredGLMM`

function of the`MuMIn`

package. - This function will calculate Nakagawa & Schielzeth’s (2013) pseudo-R-squared.
- Values above 0.50 are indicative of a good fit, as the pseudo-R-squared hardly reaches the high levels true R-squared reaches in linear regression

```
library(MuMIn)
r.squaredGLMM(mod)
```

```
## R2m R2c
## 0.4584851 0.4584851
```

- To assess the discriminative power of logistic regression models,the C-index of concordance is commonly used (also know as
*Area under Reciever Operating Characteristic Curve*; AUROC), which can be calculated with the`somers2`

function of the`Hmisc`

package - The C-index expresses how good the model is at classifying the observations in the two classes. In particular, it expresses the probability that the model will predict the second class in cases where the second class is observed (Harrell, 2015: 257)
- Hosmer et al. (2013: 177) propose the following classification of C-index values:
- 0.5: No discrimination, the model performs at chance level
- 0.5 - 0.7: Poor discrimination, hardly better than chance
- 0.7 - 0.8: Acceptable discrimination
- 0.8-0.9: Excellent discrimination
- 0.9-1: Outstanding discrimination

```
library(Hmisc)
somers2(fitted(mod), as.numeric(dataSet$Type)-1)
```

```
## C Dxy n Missing
## 0.8454777 0.6909555 1652.0000000 0.0000000
```

- Two other measures that are frequently used in the context of logistic regression (especially in the field of machine learning) are
*accuracy*,*precision*and*recall*:`Accuracy`

:`Correctly classified`

/`Classified observations`

`Precision`

:`Correctly classified as class 2`

/`Total number of times class 2 is predicted`

`Recall`

:`Correctly classified as class 2`

/`Correctly classified as class 2 + Incorrectly classified as class 1`

- To compute precision and recall, we first have to get predictions from our model

```
# Predict probabilities rather than logits
dataSet$predicted<-predict(mod, type="response")
# Probability of more than 0.5 is a classification vote for the second level of our dependent variable
dataSet$predicted<-ifelse(dataSet$predicted > 0.5, "Singular", "Plural")
```

- We can obtain the number of correct and incorrect classifications with the
`table`

function - The corresponding table is known as a
`confusion matrix`

in statistics and machine learning

```
tab <- table(dataSet$Type, dataSet$predicted)
tab
```

```
##
## Plural Singular
## Plural 516 167
## Singular 200 769
```

```
accuracy <- (tab[1] + tab[4]) /nrow(dataSet)
accuracy
```

`## [1] 0.777845`

```
precision<- tab[1]/(tab[1]+tab[2])
precision
```

`## [1] 0.7206704`

```
recall <- tab[1]/(tab[1]+tab[3])
recall
```

`## [1] 0.7554905`

- It is important that we compare the accuracy/precision of our model to the accuracy/precision we would get by just predicting the most frequent class every time

- Here we find that our precision score of 0.7206704 represents a 13.4 percent lift with regards to the precision we would get by going for the most frequent class all of the time
- Accuracy, precision and recall have been heavily criticized by statisticians (e.g. Harrell, 2015: 236), who feel that these measures should be avoided as they involve breaking up the probabilities into two classes using an arbitrary threshold.

`prop.table(table(dataSet$Type))`

```
##
## Plural Singular
## 0.4134383 0.5865617
```

- The function we defined in Topic 8 to perform our bootstrap validation on the linear model also works for
`glm`

objects if we update the model fitting function from`lm`

to`glm`

```
bootstrap <- function(model, replicates=1000) {
library(dplyr)
library(broom)
intervals <- lapply(1:replicates, FUN=function(x) {
indices <- sample(1:nrow(model$model), nrow(model$model), replace = TRUE)
dat <- model$model[indices, ]
mod <- glm(as.formula(model$call$formula),family="binomial", data=dat)
return(tidy(mod))
}) %>%
bind_rows() %>%
group_by(term) %>%
summarise(`2.5%`= quantile(estimate, 0.025), `97.5%` = quantile(estimate, 0.975))
return(intervals)
}
```

- Our confidence intervals suggest that
`Age`

,`Agt.Freq.Noun`

and`Freq.Const.Speaker`

are unstable under bootstrap validation. We should be careful when we draw inferences from them:- Confidence interval > 1 log-odds unit
- Confidence interval includes 0

```
confintervals<-bootstrap(mod, 1000)
confintervals
```

```
## # A tibble: 13 x 3
## term `2.5%` `97.5%`
## <chr> <dbl> <dbl>
## 1 (Intercept) -0.7570479 0.589300653
## 2 AgeYoungest.Speakers -0.5202631 -0.009101188
## 3 Agt.Freq.Noun -5.3899452 -1.591043881
## 4 Co.2.PrPluralized -1.0265038 -0.037571223
## 5 Co.2.PrSingular -0.2413008 0.867268771
## 6 EducationUniversity 0.3327503 0.929624371
## 7 Freq.Const.Speaker -1.4607143 0.234267286
## 8 GenderMale 0.1611661 0.662909579
## 9 PolarityPositive -0.8089560 -0.179222796
## 10 Pr.2.PrPluralized -1.4107325 -0.544301748
## 11 Pr.2.PrSingular -0.2561728 0.576957670
## 12 StyleReading.Task-Questionnaire.Task -0.1514004 0.729603130
## 13 TensePresent-Preterit 2.4929937 3.205086196
```

- All of your data transformation steps. Your analysis should be reproducible based upon your description
- Your coefficients, standard errors, test statistics and p-values
- Bootstrap confidence intervals
- Summary statistics: C-index, pseudo-R-squared, AIC
- Sample size

- Please go to: http://www.jeroenclaes.be/statistics_for_linguistics/class10.html and perform the exercises

- Harrell, F. (2015).
*Regression modeling strategies: With applications to linear models, logistic regression, and survival analysis*. New York, NY: Springer. - Hosmer, D., Lemeshow, S., Sturdivant, R. (2013).
*Applied logistic regression*. Oxford: Wiley. - Johnson, D.E. (2009). Getting off the GoldVarb Standard: Introducing Rbrul for Mixed‐Effects Variable Rule Analysis.
*Language and Linguistics Compass*3(1). 359-383. - Nakagawa, S, Schielzeth, H. (2013). A general and simple method for obtaining R² from Generalized Linear Mixed-effects Models.
*Methods in Ecology and Evolution*4. 133–142. - Speelman, D. (2014). Logistic regression: A confirmatory technique for comparisons in corpus linguistics. In
*Corpus Methods for Semantics: Quantitative studies in polysemy and synonymy*(487-533). Amsterdam/Philadephia, PA: John Benjamins.