Pearson’s Chi-Squared Test is used to evaluate
The goodness of fit between observed and estimated values.
Homogeneity between groups regarding their distribution among categorical variables.
Whether or not two variables whose frequencies are represented in a contingency table have statistical independence from one another.
The Chi-SquaredTest for Independence is used to test whether or not two categorical variables are statistically independent of eachother.
2 Categorical Variables
Random Sample
Each category contains >= 5 observations.
The table below shows number of male and female titanic passengers by Survival.
library(dplyr)
library(tidyr)
library(knitr)
titanic_survival<- data.frame(Titanic)%>%
group_by(Sex, Survived)%>%
summarize(freq=sum(Freq))%>%
spread(Sex, freq)%>%
subset(select=c("Survived", "Male","Female"))
kable(titanic_survival)
Survived | Male | Female |
---|---|---|
No | 1364 | 126 |
Yes | 367 | 344 |
We can use a chi-squared test for independence to determine whether or not survival and sex are statistically independent from eachother.
titanic_survival <- titanic_survival%>%
subset(select=c("Male","Female")) #Drop the "Survived" vector so that the table is read as a crosstab by chisq.test()
chisq.test(titanic_survival)
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: titanic_survival
## X-squared = 454.5, df = 1, p-value < 2.2e-16
A p-value <.01 tells us that we can conclude that sex and survival are not independent from eachother with 99% confidence.
The Chi-Squared Test for Goodness of Fit allows us to assess whether or not there are statistically significant differences between an observed and an expected distribution. The p-value indicates the level of statistical significance of the difference between the observed & expected distributions.
Lower p-value = Greater difference between distributions
Higher p-value = Less difference between distributions
Good Fit: Units evenly distributed between 3 groups.
#Create lists of values for this exercise
observed_distribution <- c(10, 10, 10) #Number of observations in each group
expected_distribution <- c(1/3, 1/3, 1/3) #Expected distribution across groups
#Run Chi Squared test
chisq.test(observed_distribution, p=expected_distribution)
##
## Chi-squared test for given probabilities
##
## data: observed_distribution
## X-squared = 0, df = 2, p-value = 1
A p-value = 1 indicates a no difference between the observed and the expected distribution. The expected distribution is a good fit for the observed data.
Bad Fit: Units not evenly distributed between 3 groups.
#Create lists of values for this exercise
observed_distribution <- c(3, 17, 10) #Number of observations in each group
expected_distribution <- c(1/3, 1/3, 1/3) #Expected distribution across groups
#Run Chi Squared test
chisq.test(observed_distribution, p=expected_distribution)
##
## Chi-squared test for given probabilities
##
## data: observed_distribution
## X-squared = 9.8, df = 2, p-value = 0.007447
A p-value = .007447 indicates a statistically significant difference at significance level of .01 between the observed and the expected distribution. The expected distribution is not a good fit for the observed data.
The Chi-Squared Test for Homogeneity allows us to evaluate whether or not two samples are distributed equally across various levels/categories.The p-value indicates the level of statistical significance of the difference between the observed & expected distributions.
Lower p-value = More Heterogeneous
Higher p-value = More Homogeneous
Heterogenous Example: Two samples differ in their distribution between the 4 categories/levels.
First, let’s produce sample data with a heterogenous distribution across 4 categories/levels.
A <- sample(1:4, #Levels
200, #Number of Observations
p=c(1/4,1/4,1/4,1/4), #Probabilities for Each Level
replace=TRUE) #"Replace" if sample larger than population.
B = sample(1:4, #Levels
200, #Number of Observations
p=c(1/8,1/16,3/16,2.5/4), #Probabilities for Each Level
replace=TRUE) #"Replace" if sample larger than population.
AB<- rbind(table(A),table(B))
kable(AB)
1 | 2 | 3 | 4 |
---|---|---|---|
46 | 55 | 48 | 51 |
20 | 10 | 43 | 127 |
Now, let’s run the Chi Squared Test for Homogeneity
chisq.test(AB)
##
## Pearson's Chi-squared test
##
## data: AB
## X-squared = 74.12, df = 3, p-value = 5.593e-16
A p-value less than the signficiance level of .1/.05/.01 tells us that these groups show statistically significant differences in their distributions between the 4 categories. The samples are heterogeneous
Homogenous Example: Two samples show a similar distribution between 4 categories.
First, let’s produce sample data with a heterogenous distribution across 4 categories/levels.
A <- sample(1:4, #Levels
200, #Number of Observations
p=c(1/4,1/4,1/4,1/4), #Probabilities for Each Level
replace=TRUE) #"Replace" if sample larger than population.
B = sample(1:4, #Levels
200, #Number of Observations
p=c(1/4,1/4,1/4,1/4), #Probabilities for Each Level
replace=TRUE) #"Replace" if sample larger than population.
AB<- rbind(table(A),table(B))
kable(AB)
1 | 2 | 3 | 4 |
---|---|---|---|
53 | 50 | 54 | 43 |
47 | 50 | 52 | 51 |
Now, let’s run the Chi Squared Test for Homogeneity
chisq.test(AB)
##
## Pearson's Chi-squared test
##
## data: AB
## X-squared = 1.0786, df = 3, p-value = 0.7822
A p-value greater than the signficiance level of .1/.05/.01 tells us that these groups do not show statistically significant differences in their distributions between the 4 categories. The samples are homogeneous
We can extract the following values from the chisq.test() output
MyTest <- chisq.test(AB)
MyTest$data.name
## [1] "AB"
MyTest$statistic
## X-squared
## 1.078587
MyTest$parameter
## df
## 3
MyTest$p.value
## [1] 0.7822456
MyTest$method
## [1] "Pearson's Chi-squared test"
MyTest$observed
## 1 2 3 4
## [1,] 53 50 54 43
## [2,] 47 50 52 51
MyTest$expected
## 1 2 3 4
## [1,] 50 50 53 47
## [2,] 50 50 53 47
MyTest$residuals
## 1 2 3 4
## [1,] 0.4242641 0 0.1373606 -0.58346
## [2,] -0.4242641 0 -0.1373606 0.58346