# Contents

1. Comparing the median with box-and-whisker plots
2. The basis of inferential statistics: Hypothesis testing
3. Comparing normally distributed data with t-tests
4. A little side note on Standard Errors and confidence intervals
5. Comparing non-normally distributed data with Wilcoxon-tests
6. Comparing groups with plots
7. Questions
8. References

# 1. Data for this class

• 58 responses from a visual lexical decision latency experiment for beginning learners (8 year-old Dutch children) by Perdijk et al. (2006) (data provided by Baayen, 2013).
• Here we will work with the responses to just two words: grinniken ‘to chuckle’ and ui (‘onion’).
• We will use a subset of three columns:
• Word
• Subject (participant code)
• LogRT (logarithm of the reaction times in the lexical decision experiment)
library(readr)
grinniken <- dataSet[dataSet$Word =="grinniken",] t.test( ui$LogRT, grinniken$LogRT,alternative="less", paired = FALSE) ## ## Welch Two Sample t-test ## ## data: ui$LogRT and grinniken$LogRT ## t = -1.9195, df = 49.783, p-value = 0.03033 ## alternative hypothesis: true difference in means is less than 0 ## 95 percent confidence interval: ## -Inf -0.01631595 ## sample estimates: ## mean of x mean of y ## 7.146464 7.275119 # 3.2 Performing an unpaired t-test in R (4/4) • t.test also accepts a formula specification and a data argument • Make sure that your grouping factor has the right order: • The group that is hypothesized to have lower/higher scores must be the reference level dataSet$Word <- as.factor(dataSet$Word) dataSet$Word <- relevel(dataSet$Word, ref="ui") t.test(LogRT ~ Word, data=dataSet, alternative="less", paired = FALSE) ## ## Welch Two Sample t-test ## ## data: LogRT by Word ## t = -1.9195, df = 49.783, p-value = 0.03033 ## alternative hypothesis: true difference in means is less than 0 ## 95 percent confidence interval: ## -Inf -0.01631595 ## sample estimates: ## mean in group ui mean in group grinniken ## 7.146464 7.275119 # 3.3 Paired t-test • The paired t-test has some important assumptions (Levshina, 2015: 88): • The subjects have been sampled randomly • The variable is quantitative • The differences between the pairs of scores (not the scores themselves!) are normally distributed, and/or the sample size is greater than 30 # 3.4 Performing a paired t-test in R (1/3) • Some data-wrangling later we have a nice paired dataset of 32 observations based on our dataset: • Reaction times for all the subjects for the two words Word Subject LogRT diff grinniken S12 7.348 0.1603 ui S12 7.188 0.1603 grinniken S13 6.92 -0.4321 ui S13 7.352 -0.4321 grinniken S15 7.379 0.2135 ui S15 7.165 0.2135 grinniken S20 7.64 0.03822 ui S20 7.602 0.03822 grinniken S21 7.161 0.1424 ui S21 7.018 0.1424 # 3.4 Performing a paired t-test in R (2/3) • The sample size is greater than 30 (32 rows) • The differences between our scores are normally distributed shapiro.test(dataSet2[!duplicated(dataSet2$diff), ]$diff) ## ## Shapiro-Wilk normality test ## ## data: dataSet2[!duplicated(dataSet2$diff), ]$diff ## W = 0.97584, p-value = 0.922 qqnorm(dataSet2[!duplicated(dataSet2$diff), ]$diff) qqline(dataSet2[!duplicated(dataSet2$diff), ]$diff) # 3.4 Performing a paired t-test in R (3/3) • By setting the argument paired=TRUE we tell R that we want a paired t-test • Hypothesis: • Subjects will behave differently for the two words • Null hypothesis: • There will be no difference between the two words dataSet2$Word <- as.factor(dataSet2$Word) dataSet2$Word <- relevel(dataSet2$Word, ref="ui") t.test(LogRT ~ Word, data=dataSet2, alternative="two.sided", paired=TRUE) ## ## Paired t-test ## ## data: LogRT by Word ## t = -1.9013, df = 15, p-value = 0.07664 ## alternative hypothesis: true difference in means is not equal to 0 ## 95 percent confidence interval: ## -0.24538967 0.01400181 ## sample estimates: ## mean of the differences ## -0.1156939 # 3.5 Interpreting a t-test output (1/4) • Let’s take a minute to interpret the rich output of t.test: • Test statistic t • df= degrees of freedom • p-value • Confidence interval: • If we were to repeat the same experiment over and over again with different samples, there would be a 95% probability that the interval contains the differences between the means of all these experiments • If your confidence interval includes 0 that’s a bad thing, because the true difference may be zero # 3.5 Interpreting a t-test output (2/4) • Observe that the t-test only provides information on whether or not the difference between the means is significant • It tells us nothing about the size or importance of the difference • You should always include a measure of effect size: • If you have only one variable, the difference between the means (or a standardized measure of effect size) • If you have multiple variables, a standardized measure of effect size # 3.5 Interpreting a t-test output: effect size (3/4) • For differences between means,Cohen's d is the effect size measure of choice: • (Mean 1 - Mean 2) / standard deviation • Expresses the difference between the means in standard deviation units • Here we find that the mean LogRT of ui is 0.47 standard deviation units smaller than the mean LogRT of grinniken d <- (mean(dataSet[dataSet$Word=="ui",]$LogRT) - mean(dataSet[dataSet$Word=="grinniken",]$LogRT))/ sd(dataSet$LogRT)
d
## [1] -0.474053

# 3.5 Interpreting a t-test output (4/4)

• When you report differences between means, you report:
• The two means
• Which test you used (paired, unpaired, two-tailed, one-tailed)
• T-statistic value
• Degrees of freedom
• p-value (rounded: e.g., not p = 00001244, but rather: p < 0.05)
• If you are comparing multiple groups in a single text, a standardized effect size measure
• The standard error of the mean or the confidence interval of the mean

# 4.1 Standard Error (1/3)

• Standard errors are an abstract and difficult concept to grasp
• Suppose the following:
• Our population consists of all numbers between one and ten
• We sample 100 x 100 random values from this population, allowing the same value to be included more than once (sampling with replacement)
• We calculate the mean for these 100 samples
• We get a new distribution of slightly different mean values. This is called the sampling distribution of the mean
# Our population
population <- c(1:10)

# Our sampling loop
means <- sapply(1:100, FUN=function(x) {
sample <- sample(population, 100, replace=TRUE)
return(mean(sample))
})

# Results
means
##   [1] 5.34 5.39 5.44 5.73 5.84 5.13 5.50 5.59 5.41 5.77 5.59 5.23 5.37 5.87
##  [15] 5.51 5.45 5.74 5.76 5.35 5.37 5.70 5.92 5.14 5.83 4.98 5.84 5.74 5.10
##  [29] 5.23 5.57 5.43 5.10 5.28 5.42 5.54 5.60 5.34 5.43 5.33 5.32 5.37 5.69
##  [43] 5.27 5.30 5.86 5.24 5.26 5.64 5.62 5.81 5.59 5.91 4.99 5.35 5.80 5.71
##  [57] 5.72 5.35 5.67 5.71 5.55 5.49 5.42 5.39 5.52 5.71 5.30 5.33 5.56 5.07
##  [71] 5.40 5.09 5.64 5.58 5.49 5.72 5.38 6.12 5.02 5.33 6.00 5.48 5.43 5.89
##  [85] 5.72 5.35 4.96 5.48 5.31 5.41 5.65 5.51 5.63 5.15 5.46 5.66 5.56 5.59
##  [99] 5.11 5.67

# 4.1 Standard Error (2/3)

• If we take the standard deviation of the sampling distribution of the mean of a population, we obtain the standard error (SE) of the mean
• The standard error is a hugely important statistic:
• It is the denominator in many statistical tests. E.g. the t-test without variance correction is defined as:
• (Mean 1 - Mean 2)/SE
SE <- sd(means)
SE
## [1] 0.2459502

# 4.1 Standard Error (3/3)

• Luckily, statisticians have come up with a way to estimate the standard error (i.e., the standard deviation of the sampling distribution of the mean) from a single sample:
• standard deviation/square root of sample size
sample <- sample(population, 100, replace=TRUE)
SE <- sd(sample)/sqrt(length(sample))
SE
## [1] 0.2888833

# 4.2 95% confidence intervals (1/3)

• With the standard error of the mean we can also calculate confidence intervals for the means of two groups and their difference
• Confidence interval:
• If we were repeat the same experiment over and over again with different samples, there would be a 95% probability that the interval contains the values of all these experiments

# 4.2 95% confidence intervals (2/3)

• We multiply the standard error with 1.96, the t-value for Infinite degrees of freedom and p = 0.05 (1 - 0.05 = 0.95)
• The lower bound is defined as the mean minus this value, the upper bound as the mean plus this value
• Confidence interval for the mean:
• Upper bound: mean + (1.96*SE)
• Lower bound: mean - (1.96*SE)
meanUi <- mean(dataSet[dataSet$Word=="ui",]$LogRT)
meanGrinniken <- mean(dataSet[dataSet$Word=="grinniken",]$LogRT)
SE <- sd(dataSet$LogRT)/sqrt(nrow(dataSet)) confmeanUi <- c(meanUi-(1.96*SE), meanUi+(1.96*SE)) confmeanUi  ## [1] 7.076618 7.216310 confmeanGrinniken <- c(meanGrinniken-(1.96*SE), meanGrinniken+(1.96*SE)) confmeanGrinniken ## [1] 7.205273 7.344965 # 4.3 95% confidence interval for the difference of the means and Cohen’s d • We can use the same principle to calculate confidence intervals for the differences and the effect sizes difference <- meanUi -meanGrinniken confDifference <- c(difference-(1.96*SE), difference+(1.96*SE)) confDifference ## [1] -0.19850054 -0.05880881 effectSize<-(meanUi -meanGrinniken)/sd(dataSet$LogRT)
confEffect <- c(effectSize-(1.96*SE), effectSize+(1.96*SE))
confEffect
## [1] -0.5438989 -0.4042072

# 5.1 Data

• 3,381 frequent nouns and adjectives from the Corpus of Contemporary North American English (COCA), adapted from http://www.wordfrequency.info (Davies, 2018)
• word
• pos (part of speech)
• coca_frequency
• word_length
dataSet2 <- read_csv("http://www.jeroenclaes.be/statistics_for_linguistics/datasets/class4_Davies_2008_coca_frequency.csv")
summary(dataSet2$coca_frequency) ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 4875 7474 12565 25928 26282 769254 # 5.2 Research design • Research question: • Are frequent nouns less frequent than frequent adjectives? • Hypothesis: • Frequent nouns occur less often than frequent adjectives • Null hypothesis: • Frequent nouns do not occur less often than frequent adjectives # 5.3 When to use Wilcoxon tests • If your data is very different from a normal distribution, it is best not to use a t-test, even when your sample is large enough • A Wilcoxon test does not make any assumptions about the shape of the data • It tests differences in medians: • What is the relative position of the median value of group A in the values of group B? • The null hypothesis is that the median does not shift left or right between the two groups (i.e., that its rank position does not change) • We can use the Wilcoxon test, if the data satisfies the following conditions (Levshina, 2015: 110): • Each sample has been drawn randomly from the population • The observations are independent, both between the two samples and within each sample • The variable is quantitative • The underlying population distributions should be of a similar shape # 5.3 When to use Wilcoxon tests ggplot(dataSet2[dataSet2$pos=="noun",], aes(x=coca_frequency)) +
geom_line(stat="density") +
geom_vline(aes(xintercept=mean(dataSet2$coca_frequency)), color="red") + labs(title="Nouns") # 5.3 When to use Wilcoxon tests ggplot(dataSet2[dataSet2$pos=="adjective",], aes(x=coca_frequency)) +
geom_line(stat="density") +
geom_vline(aes(xintercept=mean(dataSet2$coca_frequency)), color="red") + labs(title="Adjectives") # 5.4 Computing a Wilcoxon test in R • Like the t-test, the Wilcoxon test accepts two numeric vectors or a formula with data • Like the t-test, the Wilcoxon test requires you to specify your alternative hypothesis: • two.tailed for ‘different’ • less if the median of the second group is hypothesized to be smaller • greater if the median of the second group is hypothesized to be larger • If you use the Wilcoxon test for ordinal data (e.g., likert-scales), you have to set correct=TRUE • For use on paired data, you have to set paired=TRUE • conf.int=TRUE will give you a confidence interval of the difference between the medians of the two groups • conf.level defaults to 0.95 (95% confidence intervals), but you can set it to higher values if needed dataSet2$pos <- as.factor(dataSet2$pos) dataSet2$pos <- relevel(dataSet2$pos, ref="noun") wilcox.test(coca_frequency ~ pos, data = dataSet2, alternative="greater", conf.int=TRUE) ## ## Wilcoxon rank sum test with continuity correction ## ## data: coca_frequency by pos ## W = 1117600, p-value = 0.01674 ## alternative hypothesis: true location shift is greater than 0 ## 95 percent confidence interval: ## 123 Inf ## sample estimates: ## difference in location ## 558 # 5.5 Interpreting the output of a Wilcoxon test • The test suggests that nouns occur reliably less often than adjectives • Unfortunately, standardized effect size measures are not readily available for describing the effect size of non-normally distributed data. • Cohen’s d should not be used here, as it assumes normally distributed data # 5.6 Confidence intervals for the median • We cannot use the standard error to compute confidence intervals here • The following (more advanced) code will compute the rank orders of the upper/lower limits of the confidence interval: • We can turn it into a function to compute the confidence intervals of the median confMedian <- function(x) { sort(x)[qbinom(c(.025,.975), length(x), 0.5)] } median(dataSet2$coca_frequency)
## [1] 12565
confMedian(dataSet2$coca_frequency) ## [1] 12091 13033 # 5.7 Interpreting the output of a Wilcoxon test • When you report on a Wilcoxon test you provide: • Medians of the two groups • Difference of the medians • Confidence interval of the difference of the medians • W-statistic • p-value # 6. Comparing groups with plots # 6. Comparing groups with plots (1/3) • When you report a t-test, you will want to provide a barplot of the means • When you report a Wilcoxon test, you will want to provide a barplot of the medians • You can compute by-group means and confidence intervals easily with dplyr # 6. Comparing means with plots (2/3) • Consider the following code: • %>%is a piping function: it passes the output of the previous function on to the first argument of the next function • group_by: compute everything by Word group • summarise: summarise the values into one line per group with several columns for each group # Standard error SE <- sd(dataSet$LogRT) /sqrt(nrow(dataSet))
# Summarization of values
dataForPLot <- dataSet %>%
group_by(Word) %>%
summarise( lowerBound=mean(LogRT) - (1.96*SE),
upperBound=mean(LogRT)+(1.96*SE),
mean=mean(LogRT))
dataForPLot
## # A tibble: 2 x 4
##        Word lowerBound upperBound     mean
##      <fctr>      <dbl>      <dbl>    <dbl>
## 1        ui   7.076618   7.216310 7.146464
## 2 grinniken   7.205273   7.344965 7.275119

# 6. Comparing means with plots (3/3)

• Now we can plot dataForPlot with ggplot2
ggplot(dataForPLot, aes(x=Word, y=mean, color=Word, fill=Word)) +
geom_bar(stat="identity") +
geom_errorbar(aes(ymin=lowerBound, ymax=upperBound), color="black", width=0.5) 

# 6. Comparing medians with plots (1/2)

• If we pass our confidence interval function to dplyr, it will calculate the confidence intervals for the median of each group
# Confidence intervals for the median
confMedian <- function(x) {
sort(x)[qbinom(c(.025,.975), length(x), 0.5)]
}

# Summarization of values
dataForPLot <- dataSet2 %>%
group_by(pos) %>%
summarise(lowerBound=confMedian(coca_frequency)[1],
upperBound=confMedian(coca_frequency)[2],
median=median(coca_frequency))
dataForPLot
## # A tibble: 2 x 4
##         pos lowerBound upperBound median
##      <fctr>      <int>      <int>  <dbl>
## 1      noun      12155      13541  12684
## 2 adjective      11299      12916  12011

# 6. Comparing medians with plots (2/2)

• Now we can plot dataForPlot data with ggplot2
ggplot(dataForPLot, aes(x=pos, y=median, color=pos, fill=pos)) +
geom_bar(stat="identity") +
geom_errorbar(aes(ymin=lowerBound, ymax=upperBound), color="black", width=0.5) 

# 8. References

• Baayen, R. H. (2013). languageR: Data sets and functions with “Analyzing Linguistic Data: A practical introduction to statistics”. https://cran.r-project.org/web/packages/languageR/index.html.
• Davies, M. (2018). Top-5000 most frequent words in Corpus of Contemporary American English (COCA). http://www.wordfrequency.info.
• Levshina, N. (2015). How to do Linguistics with R: Data exploration and statistical analysis. Amsterdam/Philadelphia, PA: John Benjamins.
• Perdijk, K., Schreuder, R., Verhoeven, L. and Baayen, R. H. (2006) Tracing individual differences in reading skills of young children with linear mixed-effects models. Manuscript, Radboud University Nijmegen.