Pearson’s Chi-Squared Test

Pearson’s Chi-Squared Test is used to evaluate

  • The goodness of fit between observed and estimated values.

  • Homogeneity between groups regarding their distribution among categorical variables.

  • Whether or not two variables whose frequencies are represented in a contingency table have statistical independence from one another.



Pearson’s Chi-Squared Test for Independence


The Chi-SquaredTest for Independence is used to test whether or not two categorical variables are statistically independent of eachother.


Assumptions

  • 2 Categorical Variables

  • Random Sample

  • Each category contains >= 5 observations.

Example

The table below shows number of male and female titanic passengers by Survival.

library(dplyr)
library(tidyr)
library(knitr)

titanic_survival<- data.frame(Titanic)%>%
  group_by(Sex, Survived)%>%
  summarize(freq=sum(Freq))%>%
  spread(Sex, freq)%>%
  subset(select=c("Survived", "Male","Female"))

kable(titanic_survival)
Survived Male Female
No 1364 126
Yes 367 344

We can use a chi-squared test for independence to determine whether or not survival and sex are statistically independent from eachother.

titanic_survival <- titanic_survival%>%
  subset(select=c("Male","Female"))  #Drop the "Survived" vector so that the table is read as a crosstab by chisq.test()

chisq.test(titanic_survival)
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  titanic_survival
## X-squared = 454.5, df = 1, p-value < 2.2e-16

A p-value <.01 tells us that we can conclude that sex and survival are not independent from eachother with 99% confidence.




Pearson’s Chi-Squared Test for Goodness of Fit


The Chi-Squared Test for Goodness of Fit allows us to assess whether or not there are statistically significant differences between an observed and an expected distribution. The p-value indicates the level of statistical significance of the difference between the observed & expected distributions.

  • Lower p-value = Greater difference between distributions

  • Higher p-value = Less difference between distributions


Let’s say we are expecting observations to be equally distributed between 3 groups (1/3, 1/3, 1/3).


Good Fit: Units evenly distributed between 3 groups.

#Create lists of values for this exercise
observed_distribution <- c(10, 10, 10)           #Number of observations in each group
expected_distribution <- c(1/3, 1/3, 1/3)       #Expected distribution across groups


#Run Chi Squared test
chisq.test(observed_distribution, p=expected_distribution)
## 
##  Chi-squared test for given probabilities
## 
## data:  observed_distribution
## X-squared = 0, df = 2, p-value = 1

A p-value = 1 indicates a no difference between the observed and the expected distribution. The expected distribution is a good fit for the observed data.


Bad Fit: Units not evenly distributed between 3 groups.

#Create lists of values for this exercise
observed_distribution <- c(3, 17, 10)           #Number of observations in each group
expected_distribution <- c(1/3, 1/3, 1/3)       #Expected distribution across groups

#Run Chi Squared test
chisq.test(observed_distribution, p=expected_distribution)
## 
##  Chi-squared test for given probabilities
## 
## data:  observed_distribution
## X-squared = 9.8, df = 2, p-value = 0.007447

A p-value = .007447 indicates a statistically significant difference at significance level of .01 between the observed and the expected distribution. The expected distribution is not a good fit for the observed data.




Pearson’s Chi-Squared Test for Homogeneity


The Chi-Squared Test for Homogeneity allows us to evaluate whether or not two samples are distributed equally across various levels/categories.The p-value indicates the level of statistical significance of the difference between the observed & expected distributions.

  • Lower p-value = More Heterogeneous

  • Higher p-value = More Homogeneous


Let’s say we want to test for homogeneity between two samples (A and B) in how they are distributed between 4 categories.


Heterogenous Example: Two samples differ in their distribution between the 4 categories/levels.

First, let’s produce sample data with a heterogenous distribution across 4 categories/levels.

A <- sample(1:4,  #Levels
                     200,  #Number of Observations
                     p=c(1/4,1/4,1/4,1/4), #Probabilities for Each Level
                     replace=TRUE) #"Replace" if sample larger than population.


B = sample(1:4,  #Levels
                     200,  #Number of Observations
                     p=c(1/8,1/16,3/16,2.5/4), #Probabilities for Each Level
                     replace=TRUE) #"Replace" if sample larger than population.

AB<- rbind(table(A),table(B))

kable(AB)
1 2 3 4
46 55 48 51
20 10 43 127


Now, let’s run the Chi Squared Test for Homogeneity

chisq.test(AB)
## 
##  Pearson's Chi-squared test
## 
## data:  AB
## X-squared = 74.12, df = 3, p-value = 5.593e-16

A p-value less than the signficiance level of .1/.05/.01 tells us that these groups show statistically significant differences in their distributions between the 4 categories. The samples are heterogeneous



Homogenous Example: Two samples show a similar distribution between 4 categories.

First, let’s produce sample data with a heterogenous distribution across 4 categories/levels.

A <- sample(1:4,  #Levels
                     200,  #Number of Observations
                     p=c(1/4,1/4,1/4,1/4), #Probabilities for Each Level
                     replace=TRUE) #"Replace" if sample larger than population.


B = sample(1:4,  #Levels
                     200,  #Number of Observations
                     p=c(1/4,1/4,1/4,1/4), #Probabilities for Each Level
                     replace=TRUE) #"Replace" if sample larger than population.

AB<- rbind(table(A),table(B))

kable(AB)
1 2 3 4
53 50 54 43
47 50 52 51

Now, let’s run the Chi Squared Test for Homogeneity

chisq.test(AB)
## 
##  Pearson's Chi-squared test
## 
## data:  AB
## X-squared = 1.0786, df = 3, p-value = 0.7822

A p-value greater than the signficiance level of .1/.05/.01 tells us that these groups do not show statistically significant differences in their distributions between the 4 categories. The samples are homogeneous




Extracting Values from the chisq.test() output


We can extract the following values from the chisq.test() output

  • data.name (name(s) of the data)
  • statistic (chi-squared test statistic)
  • p.value (p-value for the test)
  • method (Type of test performed)
  • parameter (Degrees of freedom)
  • observed (observations)
  • expected (expected counts under the null hypothesis)
  • residuals (Pearson residuals)

data.name (name(s) of the data)

MyTest <- chisq.test(AB)

MyTest$data.name
## [1] "AB"

statistic (chi-squared test statistic)

MyTest$statistic
## X-squared 
##  1.078587

parameter (degrees of freedom)

MyTest$parameter
## df 
##  3

p.value (p-value for the test)

MyTest$p.value
## [1] 0.7822456

method (Type of test performed)

MyTest$method
## [1] "Pearson's Chi-squared test"

observed (observations)

MyTest$observed
##       1  2  3  4
## [1,] 53 50 54 43
## [2,] 47 50 52 51

expected (expected counts under the null hypothesis or given probabilty)

MyTest$expected
##       1  2  3  4
## [1,] 50 50 53 47
## [2,] 50 50 53 47

residuals (Pearson residuals)

MyTest$residuals
##               1 2          3        4
## [1,]  0.4242641 0  0.1373606 -0.58346
## [2,] -0.4242641 0 -0.1373606  0.58346