Pearson’s Chi-Squared Test is used to evaluate

The

**goodness of fit**between observed and estimated values.**Homogeneity**between groups regarding their distribution among categorical variables.Whether or not two variables whose frequencies are represented in a contingency table have

**statistical independence**from one another.

The Chi-SquaredTest for **Independence** is used to test whether or not two categorical variables are statistically independent of eachother.

2 Categorical Variables

Random Sample

Each category contains >= 5 observations.

The table below shows number of male and female titanic passengers by Survival.

```
library(dplyr)
library(tidyr)
library(knitr)
titanic_survival<- data.frame(Titanic)%>%
group_by(Sex, Survived)%>%
summarize(freq=sum(Freq))%>%
spread(Sex, freq)%>%
subset(select=c("Survived", "Male","Female"))
kable(titanic_survival)
```

Survived | Male | Female |
---|---|---|

No | 1364 | 126 |

Yes | 367 | 344 |

We can use a chi-squared test for independence to determine whether or not survival and sex are statistically independent from eachother.

```
titanic_survival <- titanic_survival%>%
subset(select=c("Male","Female")) #Drop the "Survived" vector so that the table is read as a crosstab by chisq.test()
chisq.test(titanic_survival)
```

```
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: titanic_survival
## X-squared = 454.5, df = 1, p-value < 2.2e-16
```

A p-value <.01 tells us that we can conclude that sex and survival are **not** independent from eachother with 99% confidence.

The Chi-Squared Test for **Goodness of Fit** allows us to assess whether or not there are statistically significant differences between an observed and an expected distribution. The **p-value** indicates the level of statistical significance of the difference between the observed & expected distributions.

Lower p-value = Greater difference between distributions

Higher p-value = Less difference between distributions

**Good Fit:** Units evenly distributed between 3 groups.

```
#Create lists of values for this exercise
observed_distribution <- c(10, 10, 10) #Number of observations in each group
expected_distribution <- c(1/3, 1/3, 1/3) #Expected distribution across groups
#Run Chi Squared test
chisq.test(observed_distribution, p=expected_distribution)
```

```
##
## Chi-squared test for given probabilities
##
## data: observed_distribution
## X-squared = 0, df = 2, p-value = 1
```

A p-value = 1 indicates a no difference between the observed and the expected distribution. The expected distribution is a good fit for the observed data.

**Bad Fit:** Units not evenly distributed between 3 groups.

```
#Create lists of values for this exercise
observed_distribution <- c(3, 17, 10) #Number of observations in each group
expected_distribution <- c(1/3, 1/3, 1/3) #Expected distribution across groups
#Run Chi Squared test
chisq.test(observed_distribution, p=expected_distribution)
```

```
##
## Chi-squared test for given probabilities
##
## data: observed_distribution
## X-squared = 9.8, df = 2, p-value = 0.007447
```

A p-value = .007447 indicates a statistically significant difference at significance level of .01 between the observed and the expected distribution. The expected distribution is not a good fit for the observed data.

The Chi-Squared Test for **Homogeneity** allows us to evaluate whether or not two samples are distributed equally across various levels/categories.The **p-value** indicates the level of statistical significance of the difference between the observed & expected distributions.

Lower p-value = More Heterogeneous

Higher p-value = More Homogeneous

**Heterogenous Example:** Two samples differ in their distribution between the 4 categories/levels.

*First, let’s produce sample data with a heterogenous distribution across 4 categories/levels.*

```
A <- sample(1:4, #Levels
200, #Number of Observations
p=c(1/4,1/4,1/4,1/4), #Probabilities for Each Level
replace=TRUE) #"Replace" if sample larger than population.
B = sample(1:4, #Levels
200, #Number of Observations
p=c(1/8,1/16,3/16,2.5/4), #Probabilities for Each Level
replace=TRUE) #"Replace" if sample larger than population.
AB<- rbind(table(A),table(B))
kable(AB)
```

1 | 2 | 3 | 4 |
---|---|---|---|

46 | 55 | 48 | 51 |

20 | 10 | 43 | 127 |

*Now, let’s run the Chi Squared Test for Homogeneity*

`chisq.test(AB)`

```
##
## Pearson's Chi-squared test
##
## data: AB
## X-squared = 74.12, df = 3, p-value = 5.593e-16
```

A p-value less than the signficiance level of .1/.05/.01 tells us that these groups show statistically significant differences in their distributions between the 4 categories. **The samples are heterogeneous**

**Homogenous Example:** Two samples show a similar distribution between 4 categories.

*First, let’s produce sample data with a heterogenous distribution across 4 categories/levels.*

```
A <- sample(1:4, #Levels
200, #Number of Observations
p=c(1/4,1/4,1/4,1/4), #Probabilities for Each Level
replace=TRUE) #"Replace" if sample larger than population.
B = sample(1:4, #Levels
200, #Number of Observations
p=c(1/4,1/4,1/4,1/4), #Probabilities for Each Level
replace=TRUE) #"Replace" if sample larger than population.
AB<- rbind(table(A),table(B))
kable(AB)
```

1 | 2 | 3 | 4 |
---|---|---|---|

53 | 50 | 54 | 43 |

47 | 50 | 52 | 51 |

*Now, let’s run the Chi Squared Test for Homogeneity*

`chisq.test(AB)`

```
##
## Pearson's Chi-squared test
##
## data: AB
## X-squared = 1.0786, df = 3, p-value = 0.7822
```

A p-value greater than the signficiance level of .1/.05/.01 tells us that these groups do not show statistically significant differences in their distributions between the 4 categories. **The samples are homogeneous**

We can extract the following values from the chisq.test() output

- data.name (name(s) of the data)
- statistic (chi-squared test statistic)
- p.value (p-value for the test)
- method (Type of test performed)
- parameter (Degrees of freedom)
- observed (observations)
- expected (expected counts under the null hypothesis)
- residuals (Pearson residuals)

```
MyTest <- chisq.test(AB)
MyTest$data.name
```

`## [1] "AB"`

`MyTest$statistic`

```
## X-squared
## 1.078587
```

`MyTest$parameter`

```
## df
## 3
```

`MyTest$p.value`

`## [1] 0.7822456`

`MyTest$method`

`## [1] "Pearson's Chi-squared test"`

`MyTest$observed`

```
## 1 2 3 4
## [1,] 53 50 54 43
## [2,] 47 50 52 51
```

`MyTest$expected`

```
## 1 2 3 4
## [1,] 50 50 53 47
## [2,] 50 50 53 47
```

`MyTest$residuals`

```
## 1 2 3 4
## [1,] 0.4242641 0 0.1373606 -0.58346
## [2,] -0.4242641 0 -0.1373606 0.58346
```