- Cross-tabulation: the basic tool to explore relations between qualitative variables
- Chi-square test of independence
- Fisher exact test
- Effect sizes for relations between qualitative variables
- Plots to explore the associations between qualitative variables
- Reporting on associations between qualitative variables
- Exercises
- Questions
- References

- We will be working with a dataset by Claes (2017)
- Corpus investigation into existential agreement variation in Peninsular Spanish
- Data drawn from
*Twitter*and*Corpus Oral y Sonoro del Español Rural (COSER)* - Random sample of 500 lines from the dataset

```
library(readr)
library(dplyr)
dataSet <- read_csv("http://www.jeroenclaes.be/statistics_for_linguistics/datasets/class3_claes_2017.csv")
head(dataSet)
```

```
## # A tibble: 6 x 17
## type noun Typical.Action.Chain.Pos province
## <chr> <chr> <chr> <chr>
## 1 singular guapas head Lérida
## 2 singular probabilidades tail.setting Cádiz
## 3 singular matarifes head Madrid
## 4 plural fotos tail.setting Barcelona
## 5 singular olivas tail.setting Toledo
## 6 singular auroras tail.setting Madrid
## # ... with 13 more variables: state <chr>, negation <chr>,
## # broad.regions <chr>, locality <chr>, before <chr>, token <chr>,
## # after <chr>, tense <chr>, corpus <chr>, long <dbl>, lat <dbl>,
## # characters_before_noun <int>, noun_length <int>
```

- To explore the relationship between two qualitative variables, we have to count how many times each combination of the two variables occur
- This is called
`crosstabulation`

- The output table is called a
`contingency table`

- This can be done with the
`table`

function - So, to count the number of times the
`singular`

and the`plural`

existential occur in each of the`negation`

groups, we use:

```
# 'x' axis first (groups/independent variable), then 'y' axis (dependent variable)
table(dataSet$negation, dataSet$type)
```

```
##
## plural singular
## absent 43 352
## present 7 98
```

- Absolute counts are nice, but we’re usually interested in the proportions of one option vs another
- This is what
`prop.table`

does. It takes two arguments:- A table
- A
`margin`

:`1`

calculate the proportions by row (each**row**sums 1)`2`

calculate the proportions by column (each**column**sums 1)

```
# 'x' axis first (groups/independent variable), then 'y' axis (dependent variable)
tab<-table(dataSet$negation, dataSet$type)
# proportions by row
prop.table(tab, 1)
```

```
##
## plural singular
## absent 0.10886076 0.89113924
## present 0.06666667 0.93333333
```

```
# proportions by column
prop.table(tab, 2)
```

```
##
## plural singular
## absent 0.8600000 0.7822222
## present 0.1400000 0.2177778
```

- The table shows that there is a sizeable difference between the frequencies of the singular and the plural existential in clauses with and without negation
- But is this difference reliably large? To test this, we need to perform a
`Chi-squared test of independence`

- The
`null hypothesis`

of this test is that there is no association between the variables:*There is no influence of the presence of negation on the occurrence rate of the singular and the plural existential*

- The test compares the
`observed frequency`

to the frequency you would expect by chance (`the expected frequency`

) to see how likely the results are

- To get the expected frequency for a table cell, we need to add the
`marginal frequencies`

to the table - Marginal frequencies are the row and column totals
- We can get the marginal frequencies with
`rowSums`

and`colSums`

```
tab <- table(dataSet$negation, dataSet$type)
colSum<- colSums(tab)
rowSum<-rowSums(tab)
```

- For the cell in column 1, row 1, the expected frequency can be found with the following formula:

```
sampleSize <- sum(tab)
(colSum[1]/sampleSize) * (rowSum[1]/sampleSize) * sampleSize
```

```
## plural
## 39.5
```

- Basically, the chi-square test functions as follows:
- Calculate the expected frequency for each cell
- Substract the expected frequency from the observed frequency. The resulting number is the
`residual`

- Divide this number by the square root of the expected frequency. The resulting number is the
`Pearson residual`

- Square all Pearson residuals and sum them. The resulting number is the
`Chi-squared statistic`

. - Look up the chi-squared value in a chi-squared probability table. The degrees of freedom are the number of rows in the table, minus one.

- The R function
`chisq.test`

does all of that for us

- The sample is randomly selected from the population of interest and the observations are independent.
- Every observation can be classifed into exactly one category according to the criterion represented by each variable (Levshina, 2015: 212)
- The expected frequency for all cells is greater than 5
- The numbers in the table are counts, not proportions (
**NEVER**run association tests on percentages!)

- The function
`chisq.test`

takes two factor or character columns as its arguments - It also accepts a matrix or a table of numeric values
- You don’t have to specify the directionality of the hypothesis, because it is always assumed to be two-tailed

`chisq.test(dataSet$negation, dataSet$type)`

```
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: dataSet$negation and dataSet$type
## X-squared = 1.2055, df = 1, p-value = 0.2722
```

- If you run the test on proportions, you end up with a p-value of 1

`chisq.test(prop.table(table(dataSet$negation, dataSet$type),2))`

```
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: prop.table(table(dataSet$negation, dataSet$type), 2)
## X-squared = 4.3064e-33, df = 1, p-value = 1
```

- If you want to inspect the
`expected frequencies`

, you can pull them out of the test (or calculate them by hand with the formula above) - Remember, if your expected frequency in one of the cells is lower than 5, the Chisq-test is not appropriate.
- R will fire the following warning when this is the case:
`Chi-squared approximation may be incorrect`

`chisq.test(dataSet$negation, dataSet$type)$expected`

```
## dataSet$type
## dataSet$negation plural singular
## absent 39.5 355.5
## present 10.5 94.5
```

- The
`chisq.test`

also provides access to the Pearson residuals - If we know the values of the Pearson residuals, we may gain insight into which cells contributed significantly to the p-value
- To compare the Pearson residuals in a meaningful way, they are standardized by dividing them by their standard deviation
`Standardized Pearson residuals`

greater than`1.96`

/smaller than`-1.96`

indicate that a cell contributed significantly to the chi-squared value at the level of 0.05- If only one “odd” table cell contributes nearly all of the chi-squared (without you expecting it to do so on theoretical grounds), there may a data imbalance issue you will have to investigate further. This is more likely to occur in larger-dimensionality tables

`chisq.test(dataSet$negation, dataSet$type)$stdres`

```
## dataSet$type
## dataSet$negation plural singular
## absent 1.280969 -1.280969
## present -1.280969 1.280969
```

- When one of your expected frequencies is lower than 5, it is better to perform a Fisher-Yates exact test

`fisher.test(dataSet$negation, dataSet$type)`

```
##
## Fisher's Exact Test for Count Data
##
## data: dataSet$negation and dataSet$type
## p-value = 0.2713
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
## 0.7313697 4.6443856
## sample estimates:
## odds ratio
## 1.708669
```

- The p-values only tell you if the differences between the groups created by the X-variable is reliably large
- They tell you nothing about the sizes of the differences
- In the
`vcd`

package, the`assocstats`

function computes several effect size statistics you can use to measure the strength of the differences - They are computed on a
`table`

```
library(vcd)
tab<-table(dataSet$negation, dataSet$type)
assocstats(tab)
```

```
## X^2 df P(> X^2)
## Likelihood Ratio 1.7875 1 0.18123
## Pearson 1.6409 1 0.20020
##
## Phi-Coefficient : 0.057
## Contingency Coeff.: 0.057
## Cramer's V : 0.057
```

- For 2x2 tables, the measures are all identical (but not for e.g., 4x2 tables!)
- Cramér’s
*V* *ϕ*coefficient (phi)- Contingency coefficient (used less often)

- Cramér’s
- For 2x2 tables, Cramér’s
*V*and*ϕ*range from 0 to 1 - For larger tables,
*ϕ*surpasses 1, so the function only returns Cramér’s*V*in that case (*ϕ*is`NA`

) - Rule of thumb for Cramer’s
*V*(and*ϕ*in 2x2 tables):- 0: no association
- 0-0.3: small effect size
- 0.3-0.5: moderate effect
- 0.5-0.99: large effect
- 1: perfect association

```
library(vcd)
tab<-table(dataSet$province, dataSet$type)
assocstats(tab)
```

```
## X^2 df P(> X^2)
## Likelihood Ratio 106.56 48 2.4837e-06
## Pearson 118.01 48 7.9382e-08
##
## Phi-Coefficient : NA
## Contingency Coeff.: 0.437
## Cramer's V : 0.486
```

- Barplots are the method of choice to plot associations between qualitative variables
- They come in three flavors:
- Side-by side: (
`position="dodge"`

) - Stacked: (
`position="stack"`

) - Percentages: (
`position="fill"`

)

- Side-by side: (
- Observe that we do not specify a
`y`

axis mapping here

`ggplot(dataSet, aes(x=negation, fill=type, color=type)) + geom_bar(position = "dodge")`

- Barplots are the method of choice to plot associations between qualitative variables
- They come in three varieties:
- Side-by side: (
`position="dodge"`

) - Stacked: (
`position="stack"`

) - Percentages: (
`position="fill"`

)

- Side-by side: (
- Observe that we do not specify a
`y`

axis mapping here

`ggplot(dataSet, aes(x=negation, fill=type, color=type)) + geom_bar(position = "stack")`

- Barplots are the method of choice to plot associations between qualitative variables
- They come in three varieties:
- Side-by side: (
`position="dodge"`

) - Stacked: (
`position="stack"`

) - Percentages: (
`position="fill"`

)

- Side-by side: (
- Observe that we do not specify a
`y`

axis mapping here - We add
`scale_y_continous`

to have nice percent-formatted y-axis labels (requires scales package)

```
library(scales)
ggplot(dataSet, aes(x=negation, fill=type, color=type)) +
geom_bar(position = "fill") +
scale_y_continuous(labels = percent)
```

- We can add data labels to our
`dodged`

plots with a text layer - For
`stacked`

and`filled`

plots, this is very complicated

```
ggplot(dataSet, aes(x=negation, fill=type, color=type, label=..count..)) +
geom_bar(position = "dodge") +
geom_text(stat='count', position=position_dodge(width = 1), vjust=-1, color="black")
```

`stacked`

plots are great to show your readers how much tokens you have for each ‘condition’ (e.g., for negation, for different locations, …)`dodged`

plots are great to compare raw frequencies`filled`

plots are preferred when you want to make the point that the relative frequency of your dependent variable varies by condition

- Contingency table
- Test type (Chi-squared, fisher’s exact)
- Test statistic (Chi-squared)
- Degrees of freedom
- Effect sizes
- Plots

- Please go to: http://www.jeroenclaes.be/statistics_for_linguistics/class6.html and perform the exercises.

???

- Claes, J. (2017). Cognitive and geographic constraints on morphosyntactic variation: The variable agreement of presentational haber in Peninsular Spanish.
*Belgian Journal of Linguistics*(31), 28-53. - Levshina, N. (2015).
*How to do linguistics with R: Data exploration and statistical analysis*. Amsterdam/Philadelphia, PA: John Benjamins. - Urdan,