Topic 5: Relations between two quantitative variables

Jeroen Claes

Contents

  1. Different types of data
  2. What are correlations?
  3. Exploring correlations with plots
  4. Pearson product-moment coefficient r
  5. Spearman’s rho and Kendall’s tau
  6. Reporting on correlations
  7. Exercises
  8. References

1. Different types of data

1. Different types of data

  • Data can be of different types:
    • Nominal or categorical (e.g., yes vs. no)
  • Quantitative:
    • Ordinal-scaled
    • Interval-scaled
    • Ratio-scaled

1. Different types of data

  • Ordinal-scaled data:
    • The truly meaningful information is contained not in the values itself, but in their ordering. E.g., likert-scales
    • 1 and 5 have no meaning in relation to each other other than e.g., the relative degree of ‘agree’ vs ‘disagree’ they represent.

1. Different types of data

  • Interval-scaled data:
    • e.g., degrees celsius
    • We know the ordering of the values
    • Each value is meaningful on the scale on its own. I.e., each value represents a temperature
    • There is no true zero:
      • 0 does not represent absence of temperature
      • It is a measurement like any other on the scale

1. Different types of data

  • Ratio-scaled data:
    • E.g., counts of bacteria on a surface
    • We know the ordering of the values
    • Each value is meaningful on the scale on its own
    • There is a true zero: 0 represents absence of bacteria on the surface

2. What are correlations?

2. What are correlations? (1/4)

  • Relationship between two interval-scaled or ratio-scaled variables, which consist in that they increase or decrease in parallel:
    • If X increases with one unit, there will be a constant increase/decrease of N units in Y
    • If X decreases with one unit, there will be a constant increase/decrease of N units in Y
  • Data must be paired: for each value in X there must be a corresponding value in Y

2. What are correlations? (2/4)

  • If X increases and Y increases as well, this is called a positive correlation
    • E.g., Age correlates positively with vocabulary size in young children: the older they get, the more words they know.
  • If X increases and Y decreases, this is called a negative correlation
    • E.g., Zipf (1935:25) found that word frequency is negatively correlated to word length: the more frequent a word is, the shorter it tends to be

2. What are correlations? (3/4)

  • When we analyze relationships between quantitative variables, we are interested in three aspects:
    1. The direction of the correlation: positive or negative correlation?
    2. The size of the correlation: how strong is the relationship between the two variables?
    3. Whether or not the relationship is statistically significant
  • The first two can be established with a correlation coefficient, the last one requires a hypothesis test based on the correlation coefficient

2. What are correlations? (4/4)

  • Observe that correlations are different from paired t-tests and Wilcoxon tests:
    • t-test compares the means of the pairs
    • Wilcoxon test compares the medians of the pairs
    • <–> Correlations explore the strengths of the associations between the values of the pairs, they offer no information on their means, medians, or the statistical significance of the relationships

3. Exploring correlations with plots

3. Exploring correlations with plots (1/3)

  • We will be working again with the dataset by Balota et al. (2007)
  • Research question:
    • What is the relationship between the Length of a word and the Mean_RT in a lexical decision task?
  • Hypothesis:
    • Shorter words will be recognized faster than longer words
  • Null hypothesis:
    • There is no difference between short and long words
library(readr)
library(dplyr)
dataSet <- read_csv("http://www.jeroenclaes.be/statistics_for_linguistics/datasets/class2_balota_et_al_2007.csv")
glimpse(dataSet)
## Observations: 100
## Variables: 4
## $ Length  <int> 8, 10, 7, 6, 12, 12, 3, 11, 11, 5, 6, 6, 11, 4, 11, 8,...
## $ Freq    <int> 131, 82, 0, 592, 2, 9, 14013, 15, 48, 290, 3264, 3523,...
## $ Mean_RT <dbl> 819.19, 977.63, 908.22, 766.30, 1125.42, 948.33, 641.6...
## $ Word    <chr> "marveled", "persuaders", "midmost", "crutch", "resusp...

3. Exploring correlations with plots (2/3)

  • Before you do anything else, it is usually a good idea to plot the two variables and their relation in a scatterplot:
    • geom_point plots dots for each pair of X, Y values
    • geom_smooth fits a line through these dots to inspect the relationship between the variables:
      • Positive correlation: highest tip of the line is on the right-hand side
      • Negative correlation: lowest tip of the line is on the right-hand side
    • geom_smooth also adds a 95% conficence interval around the line, in which the true population relationship between the variables will be situated
    • method="lm" tells ggplot to fit a linear relationship between them (linear regression)
library(ggplot2)
ggplot(dataSet, aes(x=Length, y=Mean_RT)) + 
  geom_point() + 
  geom_smooth(method="lm")

3. Exploring correlations with plots (3/3)

library(ggplot2)
ggplot(dataSet, aes(x=Length, y=Mean_RT)) + 
  geom_point() + 
  geom_smooth(method="lm")

4. Pearson’s product-moment r

4. Pearson’s product-moment r

  • Just by inspecting the plot we can already see that there is a linear relationship between the length of the word and the time subjects take to recognize it
  • To calculate how strongly the two are associated, we can calculate the Pearson product-moment correlation coefficient

4.1 Assumptions of the Pearson product-moment correlation (1/4)

  • The relationship between the two variables is monotonic:
    • Each increase in X has a parallel increase in y; each decrease in X is followed by a parallel decrease in Y or vice versa
  • The relationship between the two variables is linear:
    • Each increase of one unit in X will trigger a constant increase of N units in Y
  • There are no outliers in the data (Levshina, 2015: 122)

4.1 Assumptions of the Pearson product-moment correlation (2/4)

  • If the data fails to meet these assumptions, the correlation coefficient will not be robust
  • Non-linear relationships:
    • Try a transformation to transform X or Y to a linear relationship (e.g. square, logarithm)
library(ggplot2)
ggplot(dataSet, aes(x=Freq, y=Mean_RT)) + 
  geom_point() + 
  geom_smooth(method="lm")

4.1 Assumptions of the Pearson product-moment correlation (3/4)

  • If the data fails to meet these assumptions, the correlation coefficient will not be very robust
  • Non-linear relationships:
    • Try a transformation to transform X x Y to a linear relationship (e.g. square, logarithm)
ggplot(dataSet, aes(x=log(1+Freq), y=Mean_RT)) + 
  geom_point() + 
  geom_smooth(method="lm")

4.1 Assumptions of the Pearson product-moment correlation (4/4)

  • Non-linear relationships:
    • Use Spearman’s rho or Kendall’s tau (see below)
  • Outliers:
    • Remove outliers
  • Non-monotonic:
    • Correlation = 0, pointless

4.2 Pearson’s product-moment r (1/3)

  • Logic of the test:
    • The values of the two variables are scaled to Z-scores
    • Each scaled value of Mean_RT is multiplied with the corresponding value in Length
    • The multiplied values are summed together and divided by the sample size
sum(scale(dataSet$Length)*scale(dataSet$Mean_RT))/
  nrow(dataSet)
## [1] 0.6085981

4.2 Pearson’s product-moment r (2/3)

  • In R, you can use the cor command to calculate the Pearson product-moment r correlation coefficient
cor(dataSet$Length, dataSet$Mean_RT)
## [1] 0.6147456

4.2 Pearson’s product-moment r (3/3)

  • The correlation coefficient is 0.6147456
  • This tells us that:
    • There is a positive correlation: if Length increases, Mean_RT increases too (if the correlation is negative, then the coefficient is negative)
    • The relationship is moderately strong
      • 0: No correlation
      • +/- 0-0.3: Weak correlations
      • +/- 0.3-0.7: Moderate correlation
      • +/- 0.7-1: Strong correlation
      • +/- 1: Perfect correlation

4.3 Dangers of ignoring the assumptions

  • If we remove outliers, the correlation coeffecient will change, because outliers pull the line on the plots up or down
dataSet <- dataSet[abs(scale(dataSet$Mean_RT))<2,]
cor(dataSet$Length, dataSet$Mean_RT)
## [1] 0.5723765
  • If we fail to recognize a non-linear relationship, the correlation coefficient may be substantially diferent. Compare:
# Frequency vs Mean_RT, without Log transformation of Frequency
cor(dataSet$Freq, dataSet$Mean_RT)
## [1] -0.4115368
# Frequency vs Mean_RT, with Log transformation of Frequency
 cor(log(1+dataSet$Freq), dataSet$Mean_RT) 
## [1] -0.6171241

4.4 Testing the significance of correlations

  • If we want to test if a correlation is significant, a few additional assumptions should be met (Levshina, 2015: 126), besides the assumptions of the correlation coefficient.
  • These assumptions are shared by more advanced techniques such as linear regression:
    • The sample is randomly selected from the population it represents.
    • Both variables are interval- or ratio-scaled. Ordinal variables won’t work!
    • The sample size is greater than 30 and/or the Y-values that correspond to each value in X are normally distributed and vice versa (bivariate normal distribution)
    • The relationship between the variables is homoskedastic: the strength of the relationship between the variables is equal across the board
    • The values of X and Y are completely independent: there is no autocorrelation (correlation between values of X, correlation between the values of Y). E.g., Temperature decreases/increases gradually over the course of a year. The temperature on Feb 26 is correlated to the temperature of Feb 25 and Feb 27, because random jumps in temperature do not occur

4.4.1 Validating the assumptions

  • The sample size is greater than 30 and/or the Y-values that correspond to each value in X are normally distributed and vice versa (bivariate normal distribution):
    • mvnorm.etest from the energy package
    • First argument:
      • data.frame with our two variables
    • Second argument:
      • number of tests the function performs before returning a result (1000 is good practice)
    • Null hypothesis:
      • Data has a bivariate normal distribution (if p < 0.05, it does NOT have a bivariate normal distribution)
library(energy)

mvnorm.etest(dataSet[,c("Length", "Mean_RT")], 1000 )
## 
##  Energy test of multivariate normality: estimated parameters
## 
## data:  x, sample size 96, dimension 2, replicates 1000
## E-statistic = 0.47346, p-value = 0.903

4.4.1 Validating the assumptions

  • The relationship between the variables is homoskedastic: the strength of the relationship between the variables is equal across the board:
    • Heteroskedasticity will show up as a funnel-like pattern on a scatter plot Heteroskedasticity

4.4.1 Validating the assumptions

  • Heteroskedasticity is not an issue here at first glance:
library(energy)
ggplot(dataSet, aes(x=Length, y=Mean_RT)) + 
  geom_point() + 
  geom_smooth(method="lm")

4.4.1 Validating the assumptions

  • Heteroskedasticity is a serious problem for correlation analysis and its big brother, linear regression
  • In the package car (“Companion to Applied Regression”“) there is a function that tests for heteroskedasticity based on a linear regression model:
    • ncvTest (for non-constant variance test)
    • To be able to use it, we must define a linear regression model.
    • The null hypothesis is that the data are homoskedastic (NOT heteroskedastic). If p > 0.05, heteroskedasticity is not an issue
mod <- lm(Mean_RT~Length, data=dataSet)
library(car)
ncvTest(mod)
## Non-constant Variance Score Test 
## Variance formula: ~ fitted.values 
## Chisquare = 0.4630842    Df = 1     p = 0.4961861

4.4.1 Validating the assumptions

  • The values of X and Y are completely independent: there is no autocorrelation:
    • As heteroskedasticity, autocorrelation is also a serious issue for linear regression
    • The package car includes an implementation of the Durbin-Watson test
    • The null hypothesis is that there is no autocorrelation. If p < 0.05 your data violates the assumption of no autocorrelation
mod <- lm(Mean_RT~Length, data=dataSet)
library(car)
durbinWatsonTest(mod)
##  lag Autocorrelation D-W Statistic p-value
##    1      0.02480565      1.937535    0.74
##  Alternative hypothesis: rho != 0

4.4.2 Performing the correlation test

  • Our data has passed all of the tests. The assumptions are all satisfied.
  • We can now calculate our correlation coefficient and check to see if it is significant:
  • cor.test accepts the following arguments:
    • Our two variables
    • alternative:
      • ‘less’ if the correlation is expected to be negative
      • ‘greater’ if the correlation is expected to be positive
      • ‘two.sided’ if the hypothesis is that there is a correlation (default)
cor.test(dataSet$Length, dataSet$Mean_RT, alternative="greater")
## 
##  Pearson's product-moment correlation
## 
## data:  dataSet$Length and dataSet$Mean_RT
## t = 6.7676, df = 94, p-value = 5.555e-10
## alternative hypothesis: true correlation is greater than 0
## 95 percent confidence interval:
##  0.4466334 1.0000000
## sample estimates:
##       cor 
## 0.5723765

4.4.2 Interpreting the output of the correlation test

  • The output of cor.test tells us the following:
    • p < 0.05:
      • The null hypothesis of no correlation can be rejected
    • r = 0.57:
      • The correlation is moderately strong
    • 95% confidence interval for R: 0.45 - 1
      • The correlation will be moderately strong to very strong at the population level
cor.test(dataSet$Length, dataSet$Mean_RT, alternative="greater")
## 
##  Pearson's product-moment correlation
## 
## data:  dataSet$Length and dataSet$Mean_RT
## t = 6.7676, df = 94, p-value = 5.555e-10
## alternative hypothesis: true correlation is greater than 0
## 95 percent confidence interval:
##  0.4466334 1.0000000
## sample estimates:
##       cor 
## 0.5723765

4.4.2 Calculating the variance explained (R2)

  • Once we have established the size and the significance of the correlation between Length and Mean_RT, we can use the correlation coefficient to estimate the amount of Mean_RT variance that is explained by Length by simply squaring the correlation coefficient (R2 or R-squared) (Urdan, 2010:87-88)
    • variation explained = *r*^2
    • Length explains 32.76 percent of the variance of Mean_RT
    • Answers the question: how well does Length explain/model/predict Mean_RT?
a<-cor.test(dataSet$Length, dataSet$Mean_RT, alternative="greater")
a$estimate^2
##       cor 
## 0.3276149

4.4.2 A note of warning

  • Correlation does not imply causation:
    • Be careful to interpret correlations in terms of cause-effect. If something is statistically correlated, it does not necessarily have to be causally related (e.g., http://www.tylervigen.com/spurious-correlations)
    • Statistics may uncover a link between two variables, posterior analysis/theoretical reflection has to make sense of it

5. Spearman’s rho and Kendall’s thau

5. Spearman’s rho and Kendall’s thau

  • Spearman’s rho and Kendall’s thau should be used when your data does not satisfy the assumptions of Pearson’s product moment r
  • These tests can be used for ordinal, ratio, and interval data
  • The only assumption is that the relationship is monotonic

5.1 Data

  • Data from Bates & Goodman (1997):
    • Correlation between grammatical complexity and vocabulary size for 10 children between 16 to 30 months old
  • Research question:
    • Is there a relationship between the size of language learner’s lexicon and the complexity of their grammar?
  • Hypothesis:
    • Grammar develops on a par with vocabulary size
  • Null hypothesis:
    • There is no correlation between grammar and vocabulary size
dataSet <-read_csv("http://www.jeroenclaes.be/statistics_for_linguistics/datasets/class5_Bates_and_Goodman_1997.csv")
glimpse(dataSet)
## Observations: 10
## Variables: 3
## $ subject          <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10
## $ lexicon_size     <int> 47, 89, 131, 186, 245, 284, 362, 444, 553, 627
## $ complexity_score <int> 0, 2, 1, 3, 5, 9, 7, 16, 25, 34

5.1 Exploring the correlations between lexicon_size and complexity_score (1/3)

  • The data here are clearly monotonic: for each increase in lexicon_size there is a parallel increase in complexity_score
  • The relationship is not linear, but it is positive and monotonic
ggplot(dataSet, aes(x=lexicon_size, y=complexity_score)) + 
  geom_point() + 
  geom_smooth(method="loess") 

5.1 Exploring the correlations between lexicon_size and complexity_score (2/3)

  • The relationship may not be linear, but we could apply a log-transformation to make it linear
  • This is what we would do if we were to perform a regression
  • If we transform it, we can use Pearson’s r if it satisfies the other assumptions
ggplot(dataSet, aes(x=lexicon_size, y=log(1+complexity_score))) + 
  geom_point() + 
  geom_smooth(method="lm") 

5.1 Exploring the correlations between lexicon_size and complexity_score (3/3)

  • For non-linear monotonic relationships, we cannot use the parametric (and conceptually relatively simple) Pearson’s r
  • The non-parametric methods Spearman’s rho and Kendall’s tau are better-suited as these make no assumptions about the relationships or the shape of the data
  • These tests can also be used for ordinal data (e.g., Likert scales)
  • To use Spearman’s rho or Kendall’s tau, we simply add method="spearman" or method="kendall" to cor or cor.test
  • Kendall’s tau will generally yield less extreme correlation estimates than Spearman’s rho
cor.test(dataSet$lexicon_size, dataSet$complexity_score, method="spearman", alternative="greater")
## 
##  Spearman's rank correlation rho
## 
## data:  dataSet$lexicon_size and dataSet$complexity_score
## S = 4, p-value < 2.2e-16
## alternative hypothesis: true rho is greater than 0
## sample estimates:
##       rho 
## 0.9757576

5.1 Exploring the correlations between lexicon_size and complexity_score (3/3)

  • For non-linear monotonic relationships, we cannot use the parametric (and conceptually relatively simple) Pearson’s r
  • The non-parametric methods Spearman’s rho and Kendall’s tau are better-suited as these make no assumptions about the relationships or the shape of the data
  • These tests can also be used for ordinal data (e.g., Likert scales)
  • To use Spearman’s rho or Kendall’s tau, we simply add method="spearman" or method="kendall" to cor or cor.test
  • Kendall’s tau will generally yield less extreme correlation estimates than Spearman’s rho
cor.test(dataSet$lexicon_size, dataSet$complexity_score, method="kendall", alternative="greater")
## 
##  Kendall's rank correlation tau
## 
## data:  dataSet$lexicon_size and dataSet$complexity_score
## T = 43, p-value = 1.488e-05
## alternative hypothesis: true tau is greater than 0
## sample estimates:
##       tau 
## 0.9111111

6. Reporting on correlations

  • Correlation coefficient (r, rho or tau)
  • Degrees of freedom (for Pearson’s r)
  • p-value and test statistic (t for Pearson, S for Spearman, T for Kendall)
  • type of test (one-tailed, two-tailed)

7. Exercises

8. Questions?

??

9. References

  • Balota, D.A., Yap, M.J., & Cortese, M.J., et al. (2007). The English Lexicon Project. Behavior Research Methods, 39(3), 445–459. DOI: 10.3758/BF03193014. Data taken from Levshina (2015).
  • Bates, E., & Goodman, J. (1997). On the inseparability of grammar and the lexicon: Evidence from acquisition, aphasia and real-time processing. Language and Cognitive Processes 12(5/6). 507-586.
  • Levshina, N. (2015). How to do Linguistics with R: Data exploration and statistical analysis. Amsterdam/Philadelphia, PA: John Benjamins.
  • Zipf, G.K. (1935).The psycho-biology of language. Boston: Houghton Mifflin.