Pearson’s Chi-Squared Test

Pearson’s Chi-Squared Test is used to evaluate

The goodness of fit between observed and estimated values.
Homogeneity between groups regarding their distribution among categorical variables.
Whether or not two variables whose frequencies are represented in a contingency table have statistical independence from one another.

Pearson’s Chi-Squared Test for Independence

The Chi-SquaredTest for Independence is used to test whether or not two categorical variables are statistically independent of eachother.

Assumptions

2 Categorical Variables
Random Sample
Each category contains >= 5 observations.

Example

The table below shows number of male and female titanic passengers by Survival.

library(dplyr)
library(tidyr)
library(knitr)

titanic_survival<- data.frame(Titanic)%>%
  group_by(Sex, Survived)%>%
  summarize(freq=sum(Freq))%>%
  spread(Sex, freq)%>%
  subset(select=c("Survived", "Male","Female"))

kable(titanic_survival)

Survived	Male	Female
No	1364	126
Yes	367	344

We can use a chi-squared test for independence to determine whether or not survival and sex are statistically independent from eachother.

titanic_survival <- titanic_survival%>%
  subset(select=c("Male","Female"))  #Drop the "Survived" vector so that the table is read as a crosstab by chisq.test()

chisq.test(titanic_survival)

## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  titanic_survival
## X-squared = 454.5, df = 1, p-value < 2.2e-16

A p-value <.01 tells us that we can conclude that sex and survival are not independent from eachother with 99% confidence.

Pearson’s Chi-Squared Test for Goodness of Fit

The Chi-Squared Test for Goodness of Fit allows us to assess whether or not there are statistically significant differences between an observed and an expected distribution. The p-value indicates the level of statistical significance of the difference between the observed & expected distributions.

Lower p-value = Greater difference between distributions
Higher p-value = Less difference between distributions

Let’s say we are expecting observations to be equally distributed between 3 groups (1/3, 1/3, 1/3).

Good Fit: Units evenly distributed between 3 groups.

#Create lists of values for this exercise
observed_distribution <- c(10, 10, 10)           #Number of observations in each group
expected_distribution <- c(1/3, 1/3, 1/3)       #Expected distribution across groups


#Run Chi Squared test
chisq.test(observed_distribution, p=expected_distribution)

## 
##  Chi-squared test for given probabilities
## 
## data:  observed_distribution
## X-squared = 0, df = 2, p-value = 1

A p-value = 1 indicates a no difference between the observed and the expected distribution. The expected distribution is a good fit for the observed data.

Bad Fit: Units not evenly distributed between 3 groups.

#Create lists of values for this exercise
observed_distribution <- c(3, 17, 10)           #Number of observations in each group
expected_distribution <- c(1/3, 1/3, 1/3)       #Expected distribution across groups

#Run Chi Squared test
chisq.test(observed_distribution, p=expected_distribution)

## 
##  Chi-squared test for given probabilities
## 
## data:  observed_distribution
## X-squared = 9.8, df = 2, p-value = 0.007447

A p-value = .007447 indicates a statistically significant difference at significance level of .01 between the observed and the expected distribution. The expected distribution is not a good fit for the observed data.

Pearson’s Chi-Squared Test for Homogeneity

The Chi-Squared Test for Homogeneity allows us to evaluate whether or not two samples are distributed equally across various levels/categories.The p-value indicates the level of statistical significance of the difference between the observed & expected distributions.

Lower p-value = More Heterogeneous
Higher p-value = More Homogeneous

Let’s say we want to test for homogeneity between two samples (A and B) in how they are distributed between 4 categories.

Heterogenous Example: Two samples differ in their distribution between the 4 categories/levels.

First, let’s produce sample data with a heterogenous distribution across 4 categories/levels.

A <- sample(1:4,  #Levels
                     200,  #Number of Observations
                     p=c(1/4,1/4,1/4,1/4), #Probabilities for Each Level
                     replace=TRUE) #"Replace" if sample larger than population.


B = sample(1:4,  #Levels
                     200,  #Number of Observations
                     p=c(1/8,1/16,3/16,2.5/4), #Probabilities for Each Level
                     replace=TRUE) #"Replace" if sample larger than population.

AB<- rbind(table(A),table(B))

kable(AB)

1	2	3	4
46	55	48	51
20	10	43	127

Now, let’s run the Chi Squared Test for Homogeneity

chisq.test(AB)

## 
##  Pearson's Chi-squared test
## 
## data:  AB
## X-squared = 74.12, df = 3, p-value = 5.593e-16

A p-value less than the signficiance level of .1/.05/.01 tells us that these groups show statistically significant differences in their distributions between the 4 categories. The samples are heterogeneous

Homogenous Example: Two samples show a similar distribution between 4 categories.

First, let’s produce sample data with a heterogenous distribution across 4 categories/levels.

A <- sample(1:4,  #Levels
                     200,  #Number of Observations
                     p=c(1/4,1/4,1/4,1/4), #Probabilities for Each Level
                     replace=TRUE) #"Replace" if sample larger than population.


B = sample(1:4,  #Levels
                     200,  #Number of Observations
                     p=c(1/4,1/4,1/4,1/4), #Probabilities for Each Level
                     replace=TRUE) #"Replace" if sample larger than population.

AB<- rbind(table(A),table(B))

kable(AB)

1	2	3	4
53	50	54	43
47	50	52	51

Now, let’s run the Chi Squared Test for Homogeneity

chisq.test(AB)

## 
##  Pearson's Chi-squared test
## 
## data:  AB
## X-squared = 1.0786, df = 3, p-value = 0.7822

A p-value greater than the signficiance level of .1/.05/.01 tells us that these groups do not show statistically significant differences in their distributions between the 4 categories. The samples are homogeneous

Extracting Values from the chisq.test() output

We can extract the following values from the chisq.test() output

data.name (name(s) of the data)
statistic (chi-squared test statistic)
p.value (p-value for the test)
method (Type of test performed)
parameter (Degrees of freedom)
observed (observations)
expected (expected counts under the null hypothesis)
residuals (Pearson residuals)

data.name (name(s) of the data)

MyTest <- chisq.test(AB)

MyTest$data.name

## [1] "AB"

statistic (chi-squared test statistic)

MyTest$statistic

## X-squared 
##  1.078587

parameter (degrees of freedom)

MyTest$parameter

## df 
##  3

p.value (p-value for the test)

MyTest$p.value

## [1] 0.7822456

method (Type of test performed)

MyTest$method

## [1] "Pearson's Chi-squared test"

observed (observations)

MyTest$observed

##       1  2  3  4
## [1,] 53 50 54 43
## [2,] 47 50 52 51

expected (expected counts under the null hypothesis or given probabilty)

MyTest$expected

##       1  2  3  4
## [1,] 50 50 53 47
## [2,] 50 50 53 47

residuals (Pearson residuals)

MyTest$residuals

##               1 2          3        4
## [1,]  0.4242641 0  0.1373606 -0.58346
## [2,] -0.4242641 0 -0.1373606  0.58346