- How to load and view data from a CSV file
- Working with data.frames
- Measures of central tendency
- Normal distribution
- Dispersion measures
- What to do when your data is not distributed normally

In this class, we will be working with data by Balota et al. (2007), provided by Levshina (2015):

- Data in CSV (comma-separated values)
- Experiment in which subjects had to decide whether or not a word is an existing word.
- Columns:
**Word**,**Word length**,**Word frequency**, and**Mean reaction time**

To load data, we use the package `readr`

and we load the data straight from the course website. This works the same for local data sets.

```
library(readr)
dataSet <- read_csv("http://www.jeroenclaes.be/statistics_for_linguistics/datasets/class2_balota_et_al_2007.csv")
```

In Excel:

- Save as ‘CSV’ (Comma-separated values)
- Make sure you choose UTF-8 encoding (on Excel 2016 for Mac:
*CSV UTF-8 Encoding*) - If your Excel is set to Belgian locale, the separator will not be the comma
`,`

, but rather the semicolon`;`

. - In that case, you have to use the
`read_csv2`

function to load your data.

Now let’s check out how the object `dataSet`

looks like. Recall that we can use `str()`

for this purpose.

`str(dataSet)`

We can also get a nice table-like view of it

`View(dataSet)`

With `dim`

we compute the number of dimensions (columns and rows), with `nrow`

the number of rows, and with `ncol`

the number of columns.

`dim(dataSet)`

`## [1] 100 4`

`nrow(dataSet)`

`## [1] 100`

`ncol(dataSet)`

`## [1] 4`

`head`

Print the first N lines from the data.frame

```
#first 6 lines
head(dataSet)
```

```
## # A tibble: 6 x 4
## Length Freq Mean_RT Word
## <int> <int> <dbl> <chr>
## 1 8 131 819.19 marveled
## 2 10 82 977.63 persuaders
## 3 7 0 908.22 midmost
## 4 6 592 766.30 crutch
## 5 12 2 1125.42 resuspension
## 6 12 9 948.33 efflorescent
```

`tail`

Print the last N lines from the data.frame

```
#last 10 lines
tail(dataSet, 10)
```

```
## # A tibble: 10 x 4
## Length Freq Mean_RT Word
## <int> <int> <dbl> <chr>
## 1 6 757 641.06 club's
## 2 6 775 775.59 minced
## 3 5 14860 602.09 drain
## 4 7 31679 576.58 justice
## 5 10 181 792.93 numerology
## 6 6 158 876.56 powwow
## 7 8 429 710.63 backdoor
## 8 11 93 875.27 maladjusted
## 9 6 2434 627.11 Elaine
## 10 11 162 1458.75 diacritical
```

`glimpse`

For data with many columns, RStudio will omit the columns that do not fit the console window. There’s a function `glimpse`

in the `dplyr`

package that prints the first couple of lines of your data frame as a list.

```
library(dplyr)
glimpse(dataSet)
```

```
## Observations: 100
## Variables: 4
## $ Length <int> 8, 10, 7, 6, 12, 12, 3, 11, 11, 5, 6, 6, 11, 4, 11, 8,...
## $ Freq <int> 131, 82, 0, 592, 2, 9, 14013, 15, 48, 290, 3264, 3523,...
## $ Mean_RT <dbl> 819.19, 977.63, 908.22, 766.30, 1125.42, 948.33, 641.6...
## $ Word <chr> "marveled", "persuaders", "midmost", "crutch", "resusp...
```

`summary`

The function `summary`

prints a general overview of the values (minimum, maximum, median,mean,…) in a data.frame.

`summary(dataSet)`

```
## Length Freq Mean_RT Word
## Min. : 3.00 Min. : 0.0 Min. : 564.2 Length:100
## 1st Qu.: 6.00 1st Qu.: 53.5 1st Qu.: 713.1 Class :character
## Median : 8.00 Median : 310.5 Median : 784.9 Mode :character
## Mean : 8.23 Mean : 3350.3 Mean : 808.3
## 3rd Qu.:10.00 3rd Qu.: 2103.2 3rd Qu.: 905.2
## Max. :15.00 Max. :75075.0 Max. :1458.8
```

To extract a column (a single vector) from the data.frame, we use the dollar sign operator e.g., `dataSet$Word`

```
#first 6 lines of just the column Word
head(dataSet$Word)
```

```
## [1] "marveled" "persuaders" "midmost" "crutch"
## [5] "resuspension" "efflorescent"
```

```
#last 10 lines
tail(dataSet$Word, 10)
```

```
## [1] "club's" "minced" "drain" "justice" "numerology"
## [6] "powwow" "backdoor" "maladjusted" "Elaine" "diacritical"
```

To extract multiple columns from a data.frame, we simply put a `character vector`

of the column names we want to extract between square brackets, preceded by a comma:

```
#first 6 lines of Word and Freq
head(dataSet[, c("Word", "Freq")])
```

```
## # A tibble: 6 x 2
## Word Freq
## <chr> <int>
## 1 marveled 131
## 2 persuaders 82
## 3 midmost 0
## 4 crutch 592
## 5 resuspension 2
## 6 efflorescent 9
```

To extract rows from a data.frame we use a numeric vector of row numbers between square brackets, followed by a comma

`dataSet[c(1, 2, 5), ]`

```
## # A tibble: 3 x 4
## Length Freq Mean_RT Word
## <int> <int> <dbl> <chr>
## 1 8 131 819.19 marveled
## 2 10 82 977.63 persuaders
## 3 12 2 1125.42 resuspension
```

- You will have noticed that the general pattern for subsetting dataframes is the following:
- dataframeName[rows, columns]

- Instead of writing the row numbers ourselves, we can also ask R to compute the indices of rows that fulfill some condition.
- The following statement will return the first six rows of all rows for which
`Freq > 100`

`head(dataSet[dataSet$Freq > 100, ])`

```
## # A tibble: 6 x 4
## Length Freq Mean_RT Word
## <int> <int> <dbl> <chr>
## 1 8 131 819.19 marveled
## 2 6 592 766.30 crutch
## 3 3 14013 641.67 row
## 4 5 290 654.12 jumpy
## 5 6 3264 583.81 enjoys
## 6 6 3523 667.18 screws
```

- Other useful conditions are:
`is.na()`

(e.g.,`dataSet[is.na(dataSet$Freq), ]`

) - “is empty value”`!is.na()`

- “is NOT is empty value”`<`

- “smaller than”`<=`

- “smaller than or equal to”`>`

- “larger than”`>=`

- “larger than or equal to”`%in%`

(e.g.,`dataSet[dataSet$Freq %in% c(1,2,4)]`

) - “one of list”`%in%`

with NOT operator (e.g.,`dataSet[!dataSet$Freq %in% c(1,2,4)]`

) - “not one of list”

- Adding columns is just as easy as selecting columns
- We simply type the dataframe name, followed by the dollar-sign operator and a name for the column. Then we can assign whatever value we want to the new column

```
dataSet$LogFreq <- log(dataSet$Freq)
glimpse(dataSet)
```

```
## Observations: 100
## Variables: 5
## $ Length <int> 8, 10, 7, 6, 12, 12, 3, 11, 11, 5, 6, 6, 11, 4, 11, 8,...
## $ Freq <int> 131, 82, 0, 592, 2, 9, 14013, 15, 48, 290, 3264, 3523,...
## $ Mean_RT <dbl> 819.19, 977.63, 908.22, 766.30, 1125.42, 948.33, 641.6...
## $ Word <chr> "marveled", "persuaders", "midmost", "crutch", "resusp...
## $ LogFreq <dbl> 4.8751973, 4.4067192, -Inf, 6.3835066, 0.6931472, 2.19...
```

- Go to http://www.jeroenclaes.be/statistics_for_linguistics/class2.html
- Perform the exercises under
*data.frames*

- In statistics a set of scores/values/numbers is called a
`distribution`

. - The values in the columns
`Length`

,`Freq`

, and`Mean_RT`

each constitute a distribution. - The typical values of a distribution are called its
`central tendency`

- Before we can establish relationships between variables, we have to understand their distributions first!

- We can describe distributions by calculating the following statistics:
- Mean (average)
- Median
- Quantiles/quartiles
- Mode
- Minimum
- Maximum

Each of these statistics will allow us to say something about the typical values (`central tendency`

) of the distribution.

- The
`mean`

of a distribution equals to the sum of the values of a distribution divided by the number of values in the distribution. - Summarizes the typical values of the distribution in a single number

```
#Mean word length
mean(dataSet$Length, na.rm=TRUE)
```

`## [1] 8.23`

- The
`median`

of a distribution equals to the value that marks the middle of the distribution. - If a distribution has 100 values and we rank them from small to large, the median will be the 50th value.

`median(dataSet$Length, na.rm=TRUE)`

`## [1] 8`

When combined with the mean, the median provides a good idea of the central tendency of a distribution:

- The mean is prone to shift upwards or downwards if there is a single atypical value (called
`outlier`

in statistics) - The median remains identical

```
#Compare
median(c(1, 2, 3, 4, 5, 6, 7, 8000, 190, 1000000)) # mean = 100821.8
```

`## [1] 5.5`

`median(c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)) # mean = 5.5`

`## [1] 5.5`

- A related concept are
`quantiles`

(median = 0.5 quantile) `0.25 quantile`

is also called the`first quartile`

, median is second quartile etc.- Quantiles express that a percentage of the values of the distribution is lower than a particular value
- Useful to study the spread of the values in the distribution

```
# One quantile
quantile(dataSet$Length, 0.25)
```

```
## 25%
## 6
```

```
# Multiple quantiles
quantile(dataSet$Length, c(0.25, 0.5, 0.75))
```

```
## 25% 50% 75%
## 6 8 10
```

- Most frequent value in a distribution
- Not widely used

```
# table() counts the number of times a value occurs in a vector
tab <- table(dataSet$Length)
# We sort the table by its values, in decreasing order
sorted <- sort(tab, decreasing = TRUE)
# We extract the first value, which is the highest.
mode <- sorted[1]
# The number 8 is the mode. It occurs 16 time
mode
```

```
## 8
## 16
```

```
# In one line:
sort(table(dataSet$Length), decreasing = TRUE)[1]
```

```
## 8
## 16
```

- Minimum: lowest value

`min(dataSet$Length, na.rm=TRUE)`

`## [1] 3`

- Maximum: highest value

`max(dataSet$Length, na.rm=TRUE)`

`## [1] 15`

By calling `summary`

on a single column, you can calculate many of the statistics discussed here with one command

`summary(dataSet$Length) `

```
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.00 6.00 8.00 8.23 10.00 15.00
```

- Mean
- Median

If the mean is lower than the median, the distribution is `negatively skewed`

. If the mean is higher than the median, it is `positively skewed`

.

- Go to http://www.jeroenclaes.be/statistics_for_linguistics/class2.html
- Perform the excercises under
*2.Measures of central tendency*

- Statisticians have identified different types of distributions
- The most important is the
`normal`

or`Gaussian`

distribution - Most statistical tests assume that your data
`approaches`

this**theoretical**distribution:- Important to check whether your data is
`normally distributed`

- Otherwise statistical tests may produce unreliable results!

- Important to check whether your data is
- Most data drawn from sufficiently large samples will be normally distributed

- The
`mean`

and the`median`

are not very different (easy test) - The largest amount of values occur around the mean, the lowest amount of values occur near the minimum and maximum
- The distribution is symmetrical: there are about as many values that are smaller than the mean than there are values that are larger than the mean

- This pattern shows up as a
`bell-shaped curve`

in density plots

- Plotting (preferred)
- Shapiro-Wilk test

`ggplot`

- Let’s take a moment to learn the basic syntax of
`ggplot`

,a highly structured (almost language-like) plotting package - This command sets up the basis of
`ggplot`

:- It defines the data we are working with (
`dataSet`

) - It defines the aesthetics (
`aes()`

) that are inherited by the other plotting functions that follow behind the`+`

signs:- Our
`x`

axis (the horizontal axis) will plot the`Length`

variable - Our
`y`

axis (the vertical axis) will plot the`Mean_RT`

variable - If we don’t specify a
`y`

axis, this will have to be calculated by a plotting function (`stat="bin"`

or`stat="density"`

)

- Our

- It defines the data we are working with (

```
library(ggplot2)
ggplot(dataSet, aes(x=Length, y=Mean_RT))
```

`ggplot`

- After the basic setup, we can add multiple plot layers by adding plotting functions separated by a
`+`

sign:`geom_line()`

for line graphs`geom_bar()`

for bar charts`geom_vline()`

for vertical lines`geom_abline()`

for horizontal lines`geom_point()`

for points (`scatterplots`

)`stat_summary()`

to summarise the`y`

values in your data in function of`x`

before plotting (here we take the mean of all`Mean_RT`

values that correspond to the same`Length`

value)

```
library(ggplot2)
ggplot(dataSet, aes(x=Length, y=Mean_RT)) +
stat_summary(fun.y = "mean", geom = "line")
```

`ggplot`

```
library(ggplot2)
ggplot(dataSet, aes(x=Length, y=Mean_RT)) +
stat_summary(fun.y = "mean", geom = "line")
```

- Suppose we want to know if our variable
`Length`

approaches a normal distribution:- We draw up a density plot, an ordered plot that visualizes the number of occurrences of each value in our distribution
- We add a red vertical line at the mean
- We look for the characteristic bell curve and symmetrical pattern
- Observe that the
`geom_vline`

has its own aesthetic:`xintercept`

, which defines the point where it crosses/intercepts the`x`

axis (the x - intercept). We set this aesthetic to the mean of our`Length`

variable

```
library(ggplot2)
ggplot(dataSet, aes(x=Length)) +
geom_line(stat="density") +
geom_vline(aes(xintercept=mean(dataSet$Length)), color="red")
```

```
library(ggplot2)
ggplot(dataSet, aes(x=Length)) +
geom_line(stat="density") +
geom_vline(aes(xintercept=mean(dataSet$Length)), color="red")
```

- Suppose we want to know if our variable
`Length`

approaches a normal distribution:- We draw up a histogram, an ordered plot that visualizes the number of occurrences of each value in our distribution
- We add a red vertical line at the mean
- We look for the characteristic bell curve and symmetrical pattern

```
library(ggplot2)
ggplot(dataSet, aes(x=Length)) +
geom_histogram(bins=11) +
geom_vline(aes(xintercept=mean(dataSet$Length)), color="red")
```

- Another way to identify if your data are normally distributed are QQ-plots
- If the dots follow the line closely, the data are normally distributed

```
qqnorm(dataSet$Length)
qqline(dataSet$Length)
```

```
qqnorm(dataSet$Length)
qqline(dataSet$Length)
```

- If p < 0.05 (don’t worry about what that means for now), the data is
**not**normally distributed - Plotting is preferred (especially QQ-plots), because this test depends on sample size

`shapiro.test(dataSet$Length)`

```
##
## Shapiro-Wilk normality test
##
## data: dataSet$Length
## W = 0.97756, p-value = 0.08559
```

- Go to http://www.jeroenclaes.be/statistics_for_linguistics/class2.html
- Do the excercises under
*3. The normal distribution*

- If you data are normally distributed, the
`mean`

and the`median`

are good representatives of the typical values of a distribution - Our analysis of the
`Freq`

and`Mean_RT`

variables have shown that this is not the case for non-normal distributions - Measures of dispersion express how much variation there is between values of a distribution. They include:
- Range
- Variance
- Standard deviation
- Interquartile Range
- Median absolute deviation

- A
`boxplot`

allows you to display dispersion graphically

The range is the difference between the maximum and the minimum value of a distribution

`max(dataSet$Length) - min(dataSet$Length)`

`## [1] 12`

- Sum of the squared deviations from the mean, divided by the sample size minus one.
- The larger the variance, the larger the differences between the individual values and the mean
- Not on the same scale as the data: squared (the sum of the deviations from the mean would be zero without squaring it)
- If the distribution is non-normal, this will lead us to overestimate the variation between the mean and an average score

```
# Same as:
# sum((dataSet$Length-mean(dataSet$Length))^2 )/(length(dataSet$Length)-1)
var(dataSet$Length)
```

`## [1] 6.259697`

- Then standard deviation is the square root of the
`variance`

- By taking the square root of the variance (which are the squared deviations from the mean) we get back to the original units of our data
- Expresses the average deviation from the mean
- If the distribution is non-normal, the standard deviation will lead us to overestimate the variation between the mean and an average score

`sd(dataSet$Length)`

`## [1] 2.501939`

- The interquartile range expresses the difference between the first quartile (the 25% lowest values) and the third quartile (the 75% lowest values)
- More robust for non-normal distributions than the variance and standard deviations

`IQR(dataSet$Length)`

`## [1] 4`

- The median of the deviations between the values in the distribution and the mean
- Based on the median, so it is not sensitive to outliers

`mad(dataSet$Length)`

`## [1] 2.9652`

- A box-and-whisker plot displays the
`mean`

and the`min`

and`max`

individual values - If there are outliers, the lines will be longer on one end of the graph

```
library(ggplot2)
ggplot(dataSet, aes(x=1, y=Length)) + geom_boxplot()
```

```
library(ggplot2)
ggplot(dataSet, aes(x=1, y=Length)) + geom_boxplot()
```

When you describe a distribution, you report:

- Mean
- Median
- AND:
- If the data are
**normally distributed**, you report the`standard deviation`

- If the data are not normally distributed, you report
`IQR`

or`mad`

scores

- If the data are
- Ideally you also provide a box-and-whisker plot

- Go to http://www.jeroenclaes.be/statistics_for_linguistics/class2.html
- Do the excercises under
*4. Measures of dispersion*

- We have seen that
`Freq`

and`Mean_RT`

are not normally distributed. - This does not mean that we cannot analyse them, but we will have to be careful. Most tests do not behave well when the data is not normally distributed or contains outliers (e.g., standard deviation and variance)
- In general when you encounter a non-normal distribution you have two options:
- Remove outliers
- Apply a transformation

- If the deviation from the normal distribution is caused by a few atypical values, you may exclude these from your analysis
- Let’s consider
`Mean_RT`

`summary(dataSet$Mean_RT)`

```
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 564.2 713.1 784.9 808.3 905.2 1458.8
```

- The summary shows that the interquartile range is 192.1
- The difference between the 3rd quartile and the maximum value is 553.6.
- This suggest that there may be a few exceptionally large values in the right tail of the distribution

- Subjects appear to have had much difficulty with recognizing the top-4 words
`diacritical`

,`dessertspoon`

,`acquisitiveness`

, and`resuspension`

- Maybe these were not the best test items?

Length | Freq | Mean_RT | Word | LogFreq |
---|---|---|---|---|

11 | 162 | 1459 | diacritical | 5.088 |

12 | 11 | 1314 | dessertspoon | 2.398 |

15 | 4 | 1217 | acquisitiveness | 1.386 |

12 | 2 | 1125 | resuspension | 0.6931 |

11 | 185 | 1089 | denominated | 5.22 |

11 | 46 | 1055 | euphemistic | 3.829 |

- It is usually a good idea to select outliers based on
`z-scores`

`z-scores`

express how far the value is removed from the mean in standard deviation units- Positive z-scores are values that are larger than the mean, negative z-scores are values that are smaller

```
#(dataSet$Mean_RT - mean(dataSet$Mean_RT))/sd(dataSet$Mean_RT)
dataSet$zscores <- scale(dataSet$Mean_RT)
summary(dataSet$zscores)
```

```
## V1
## Min. :-1.5931
## 1st Qu.:-0.6210
## Median :-0.1522
## Mean : 0.0000
## 3rd Qu.: 0.6328
## Max. : 4.2459
```

- With this score we remove all values that are more than two standard deviations removed from the mean
`(absolute value of z >= 2)`

(Urdan, 2010: 18) - We need the absolute values (we have to ignore their +/- sign), so we wrap the scores with
`abs`

`dataSet <- dataSet[abs(dataSet$zscores) <= 2, ]`

- To check if our problem is solved, we can plot our variable

```
ggplot(dataSet, aes(x=Mean_RT)) +
geom_line(stat="density") +
geom_vline(aes(xintercept=mean(dataSet$Mean_RT)), color="red")
```

- The variable
`Freq`

cannot be brought to a normal distribution by removing a few outliers - It has the typical
`power-law distribution`

of word frequencies (Zipf’s law):- A few words occur very often
- The rest of the words have a very low frequency

- Distribution with a very broad range
- Can only be corrected with a transformation

- Popular transformations are:
- Logarithm:
- Expresses to which power we have to raise the mathematical constant
`e = 2.718281828459`

(the base) to obtain the input number`log(1000) = 6.907755`

, because`2.718281828459 ^ 6.907755 = 1000`

- Other bases are:
`10`

(`log10(1000) = 3; 10 ^ 3 = 1000`

)`2`

(`log2(1000) = 9.965784; 2 ^ 9.965784 = 1000`

)

- Fixes
`power-law distributions`

- Exponentiate the base with the logarithm to undo

- Expresses to which power we have to raise the mathematical constant

- Logarithm:

- Popular transformations are:
- Logarithm (fixes
`power-law`

distributions) - Square root (fixes
`positively skewed distributions`

) - Square transformation (fixes
`negatively skewed distributions`

) - Reciprocal transformation (
`1/(vector + 1)`

), for distributions that look like a`j`

- Logarithm (fixes
- Best practice is to check which transformation works best for your data

- Let’s apply the logarithm to the
`Freq`

variable - The log of
`0`

is`-Inf`

, which would render our data useless. - We add
`1`

to the values of`Freq`

and then we apply the transformation (the log of`1`

is`0`

)

```
dataSet$transformedFreq <- log(1+dataSet$Freq)
qqnorm(dataSet$transformedFreq)
qqline(dataSet$transformedFreq)
```

- There’s always a trade-off to be made:
- If you need to have your distribution to be normal for some test, then you can consider a transformation
- Transforming your data makes it harder to interpret the values. E.g.,
`log(dataSet$Freq)`

is less intuitively understandable than`dataSet$Freq`

- We will revisit transformations in later classes

?

- Balota, D.A., Yap, M.J., & Cortese, M.J., et al. (2007). The English Lexicon Project.
*Behavior Research Methods*, 39(3), 445–459. DOI: 10.3758/BF03193014. Data taken from Levshina (2015). - Levshina, N. (2015).
*How to do Linguistics with R: Data exploration and statistical analysis*. Amsterdam/Philadelphia, PA: John Benjamins.