# Pearson’s Chi-Squared Test

Pearson’s Chi-Squared Test is used to evaluate

• The goodness of fit between observed and estimated values.

• Homogeneity between groups regarding their distribution among categorical variables.

• Whether or not two variables whose frequencies are represented in a contingency table have statistical independence from one another.

## Pearson’s Chi-Squared Test for Independence

The Chi-SquaredTest for Independence is used to test whether or not two categorical variables are statistically independent of eachother.

#### Assumptions

• 2 Categorical Variables

• Random Sample

• Each category contains >= 5 observations.

#### Example

The table below shows number of male and female titanic passengers by Survival.

library(dplyr)
library(tidyr)
library(knitr)

titanic_survival<- data.frame(Titanic)%>%
group_by(Sex, Survived)%>%
summarize(freq=sum(Freq))%>%
subset(select=c("Survived", "Male","Female"))

kable(titanic_survival)
Survived Male Female
No 1364 126
Yes 367 344

We can use a chi-squared test for independence to determine whether or not survival and sex are statistically independent from eachother.

titanic_survival <- titanic_survival%>%
subset(select=c("Male","Female"))  #Drop the "Survived" vector so that the table is read as a crosstab by chisq.test()

chisq.test(titanic_survival)
##
##  Pearson's Chi-squared test with Yates' continuity correction
##
## data:  titanic_survival
## X-squared = 454.5, df = 1, p-value < 2.2e-16

A p-value <.01 tells us that we can conclude that sex and survival are not independent from eachother with 99% confidence.

## Pearson’s Chi-Squared Test for Goodness of Fit

The Chi-Squared Test for Goodness of Fit allows us to assess whether or not there are statistically significant differences between an observed and an expected distribution. The p-value indicates the level of statistical significance of the difference between the observed & expected distributions.

• Lower p-value = Greater difference between distributions

• Higher p-value = Less difference between distributions

#### Let’s say we are expecting observations to be equally distributed between 3 groups (1/3, 1/3, 1/3).

Good Fit: Units evenly distributed between 3 groups.

#Create lists of values for this exercise
observed_distribution <- c(10, 10, 10)           #Number of observations in each group
expected_distribution <- c(1/3, 1/3, 1/3)       #Expected distribution across groups

#Run Chi Squared test
chisq.test(observed_distribution, p=expected_distribution)
##
##  Chi-squared test for given probabilities
##
## data:  observed_distribution
## X-squared = 0, df = 2, p-value = 1

A p-value = 1 indicates a no difference between the observed and the expected distribution. The expected distribution is a good fit for the observed data.

Bad Fit: Units not evenly distributed between 3 groups.

#Create lists of values for this exercise
observed_distribution <- c(3, 17, 10)           #Number of observations in each group
expected_distribution <- c(1/3, 1/3, 1/3)       #Expected distribution across groups

#Run Chi Squared test
chisq.test(observed_distribution, p=expected_distribution)
##
##  Chi-squared test for given probabilities
##
## data:  observed_distribution
## X-squared = 9.8, df = 2, p-value = 0.007447

A p-value = .007447 indicates a statistically significant difference at significance level of .01 between the observed and the expected distribution. The expected distribution is not a good fit for the observed data.

## Pearson’s Chi-Squared Test for Homogeneity

The Chi-Squared Test for Homogeneity allows us to evaluate whether or not two samples are distributed equally across various levels/categories.The p-value indicates the level of statistical significance of the difference between the observed & expected distributions.

• Lower p-value = More Heterogeneous

• Higher p-value = More Homogeneous

#### Let’s say we want to test for homogeneity between two samples (A and B) in how they are distributed between 4 categories.

Heterogenous Example: Two samples differ in their distribution between the 4 categories/levels.

First, let’s produce sample data with a heterogenous distribution across 4 categories/levels.

A <- sample(1:4,  #Levels
200,  #Number of Observations
p=c(1/4,1/4,1/4,1/4), #Probabilities for Each Level
replace=TRUE) #"Replace" if sample larger than population.

B = sample(1:4,  #Levels
200,  #Number of Observations
p=c(1/8,1/16,3/16,2.5/4), #Probabilities for Each Level
replace=TRUE) #"Replace" if sample larger than population.

AB<- rbind(table(A),table(B))

kable(AB)
1 2 3 4
46 55 48 51
20 10 43 127

Now, let’s run the Chi Squared Test for Homogeneity

chisq.test(AB)
##
##  Pearson's Chi-squared test
##
## data:  AB
## X-squared = 74.12, df = 3, p-value = 5.593e-16

A p-value less than the signficiance level of .1/.05/.01 tells us that these groups show statistically significant differences in their distributions between the 4 categories. The samples are heterogeneous

Homogenous Example: Two samples show a similar distribution between 4 categories.

First, let’s produce sample data with a heterogenous distribution across 4 categories/levels.

A <- sample(1:4,  #Levels
200,  #Number of Observations
p=c(1/4,1/4,1/4,1/4), #Probabilities for Each Level
replace=TRUE) #"Replace" if sample larger than population.

B = sample(1:4,  #Levels
200,  #Number of Observations
p=c(1/4,1/4,1/4,1/4), #Probabilities for Each Level
replace=TRUE) #"Replace" if sample larger than population.

AB<- rbind(table(A),table(B))

kable(AB)
1 2 3 4
53 50 54 43
47 50 52 51

Now, let’s run the Chi Squared Test for Homogeneity

chisq.test(AB)
##
##  Pearson's Chi-squared test
##
## data:  AB
## X-squared = 1.0786, df = 3, p-value = 0.7822

A p-value greater than the signficiance level of .1/.05/.01 tells us that these groups do not show statistically significant differences in their distributions between the 4 categories. The samples are homogeneous

## Extracting Values from the chisq.test() output

We can extract the following values from the chisq.test() output

• data.name (name(s) of the data)
• statistic (chi-squared test statistic)
• p.value (p-value for the test)
• method (Type of test performed)
• parameter (Degrees of freedom)
• observed (observations)
• expected (expected counts under the null hypothesis)
• residuals (Pearson residuals)

#### data.name (name(s) of the data)

MyTest <- chisq.test(AB)

## X-squared
##  1.078587

## [1] 0.7822456

MyTest$method ## [1] "Pearson's Chi-squared test" #### observed (observations) MyTest$observed
##       1  2  3  4
## [1,] 53 50 54 43
## [2,] 47 50 52 51

MyTest$expected ## 1 2 3 4 ## [1,] 50 50 53 47 ## [2,] 50 50 53 47 #### residuals (Pearson residuals) MyTest$residuals
##               1 2          3        4
## [1,]  0.4242641 0  0.1373606 -0.58346
## [2,] -0.4242641 0 -0.1373606  0.58346