There are many options for producing contingency tables and summary tables in R.
We will review the following methods:
The more things you can accomplish within the tidyverse of r packages, the better (IMO). Using dplyr to produce your summary stats enables you to continue the code seamlessly into the next task (filtering, plotting, etc…).
The group_by(), summarize(), and spread() commands are a useful combination for producing aggregate or summary values of our data.
library(ggplot2)
library(dplyr)
library(tidyr)
library(knitr) #for printing html-friendly tables.
manufacturer | model | displ | year | cyl | trans | drv | cty | hwy | fl | class |
---|---|---|---|---|---|---|---|---|---|---|
audi | a4 | 1.8 | 1999 | 4 | auto(l5) | f | 18 | 29 | p | compact |
audi | a4 | 1.8 | 1999 | 4 | manual(m5) | f | 21 | 29 | p | compact |
audi | a4 | 2.0 | 2008 | 4 | manual(m6) | f | 20 | 31 | p | compact |
audi | a4 | 2.0 | 2008 | 4 | auto(av) | f | 21 | 30 | p | compact |
audi | a4 | 2.8 | 1999 | 6 | auto(l5) | f | 16 | 26 | p | compact |
audi | a4 | 2.8 | 1999 | 6 | manual(m5) | f | 18 | 26 | p | compact |
Here, we can get the total number of cars with each class & cyl combination using group_by() and summarize().
mpg%>%
group_by(class, cyl)%>%
summarize(n=n())%>%
kable()
class | cyl | n |
---|---|---|
2seater | 8 | 5 |
compact | 4 | 32 |
compact | 5 | 2 |
compact | 6 | 13 |
midsize | 4 | 16 |
midsize | 6 | 23 |
midsize | 8 | 2 |
minivan | 4 | 1 |
minivan | 6 | 10 |
pickup | 4 | 3 |
pickup | 6 | 10 |
pickup | 8 | 20 |
subcompact | 4 | 21 |
subcompact | 5 | 2 |
subcompact | 6 | 7 |
subcompact | 8 | 5 |
suv | 4 | 8 |
suv | 6 | 16 |
suv | 8 | 38 |
To turn our summary data into a crosstab or contingency table, we need variable A (class) to be listed by row, and variable B (cyl) to be listed by column.
We can achieve this by including the spread() command, to create columns for each cyl value, with n as the crosstab response value.
mpg%>%
group_by(class, cyl)%>%
summarise(n=n())%>%
spread(cyl, n)%>%
kable()
class | 4 | 5 | 6 | 8 |
---|---|---|---|---|
2seater | NA | NA | NA | 5 |
compact | 32 | 2 | 13 | NA |
midsize | 16 | NA | 23 | 2 |
minivan | 1 | NA | 10 | NA |
pickup | 3 | NA | 10 | 20 |
subcompact | 21 | 2 | 7 | 5 |
suv | 8 | NA | 16 | 38 |
One advantage of dplyr is that we can determine what kind of summary statistic we want to see very easily by adjusting our summarize() input.
Here instead of displaying frequencies, we can get the average number of city miles by class & cyl
mpg%>%
group_by(class, cyl)%>%
summarise(mean_cty=mean(cty))%>%
spread(cyl, mean_cty)%>%
kable()
class | 4 | 5 | 6 | 8 |
---|---|---|---|---|
2seater | NA | NA | NA | 15.40000 |
compact | 21.37500 | 21 | 16.92308 | NA |
midsize | 20.50000 | NA | 17.78261 | 16.00000 |
minivan | 18.00000 | NA | 15.60000 | NA |
pickup | 16.00000 | NA | 14.50000 | 11.80000 |
subcompact | 22.85714 | 20 | 17.00000 | 14.80000 |
suv | 18.00000 | NA | 14.50000 | 12.13158 |
Or max number of city miles by class & cyl
mpg%>%
group_by(class, cyl)%>%
summarise(max_cty=max(cty))%>%
spread(cyl, max_cty)%>%
kable()
class | 4 | 5 | 6 | 8 |
---|---|---|---|---|
2seater | NA | NA | NA | 16 |
compact | 33 | 21 | 18 | NA |
midsize | 23 | NA | 19 | 16 |
minivan | 18 | NA | 17 | NA |
pickup | 17 | NA | 16 | 14 |
subcompact | 35 | 20 | 18 | 15 |
suv | 20 | NA | 17 | 14 |
We can find proportions by creating a new, calculated variable dividing row frequency by table frequency.
mpg%>%
group_by(class)%>%
summarize(n=n())%>%
mutate(prop=n/sum(n))%>% # our new proportion variable
kable()
class | n | prop |
---|---|---|
2seater | 5 | 0.0213675 |
compact | 47 | 0.2008547 |
midsize | 41 | 0.1752137 |
minivan | 11 | 0.0470085 |
pickup | 33 | 0.1410256 |
subcompact | 35 | 0.1495726 |
suv | 62 | 0.2649573 |
We can create a contingency table of proportion values by applying the same spread command as before. Vary the group_by() and spread() arguents to produce proportions of different variables.
mpg%>%
group_by(class, cyl)%>%
summarize(n=n())%>%
mutate(prop=n/sum(n))%>%
subset(select=c("class","cyl","prop"))%>% #drop the frequency value
spread(class, prop)%>%
kable()
cyl | 2seater | compact | midsize | minivan | pickup | subcompact | suv |
---|---|---|---|---|---|---|---|
4 | NA | 0.6808511 | 0.3902439 | 0.0909091 | 0.0909091 | 0.6000000 | 0.1290323 |
5 | NA | 0.0425532 | NA | NA | NA | 0.0571429 | NA |
6 | NA | 0.2765957 | 0.5609756 | 0.9090909 | 0.3030303 | 0.2000000 | 0.2580645 |
8 | 1 | NA | 0.0487805 | NA | 0.6060606 | 0.1428571 | 0.6129032 |
table() is a quick way to pull together row/column frequencies and proportions for categorical variables
Using the basic table() command, we can get a contingency table of vehicle class by number of cylinders.
table(mpg$class, mpg$cyl)
##
## 4 5 6 8
## 2seater 0 0 0 5
## compact 32 2 13 0
## midsize 16 0 23 2
## minivan 1 0 10 0
## pickup 3 0 10 20
## subcompact 21 2 7 5
## suv 8 0 16 38
The table frequency can also be called by using the ftable() command.
mpg_table<- table(mpg$class, mpg$cyl) #define object w/table parameters for simple calling
ftable(mpg_table)
## 4 5 6 8
##
## 2seater 0 0 0 5
## compact 32 2 13 0
## midsize 16 0 23 2
## minivan 1 0 10 0
## pickup 3 0 10 20
## subcompact 21 2 7 5
## suv 8 0 16 38
For row frequencies, we use the margin.table() command, with the 1 argument.
margin.table(mpg_table, 1)
##
## 2seater compact midsize minivan pickup subcompact
## 5 47 41 11 33 35
## suv
## 62
For column frequencies, we use the margin.table() command, with the 2 argument.
margin.table(mpg_table, 2)
##
## 4 5 6 8
## 81 4 79 70
We can get the proportion values for our variable combinations as well.
For proportion of the entire table, we use the prop.table() command.
prop.table(mpg_table) #proportion of entire table
##
## 4 5 6 8
## 2seater 0.000000000 0.000000000 0.000000000 0.021367521
## compact 0.136752137 0.008547009 0.055555556 0.000000000
## midsize 0.068376068 0.000000000 0.098290598 0.008547009
## minivan 0.004273504 0.000000000 0.042735043 0.000000000
## pickup 0.012820513 0.000000000 0.042735043 0.085470085
## subcompact 0.089743590 0.008547009 0.029914530 0.021367521
## suv 0.034188034 0.000000000 0.068376068 0.162393162
For row proportions, we use the prop.table() command, with the 1 argument following the table name.
prop.table(mpg_table, 1) #proportion of entire row
##
## 4 5 6 8
## 2seater 0.00000000 0.00000000 0.00000000 1.00000000
## compact 0.68085106 0.04255319 0.27659574 0.00000000
## midsize 0.39024390 0.00000000 0.56097561 0.04878049
## minivan 0.09090909 0.00000000 0.90909091 0.00000000
## pickup 0.09090909 0.00000000 0.30303030 0.60606061
## subcompact 0.60000000 0.05714286 0.20000000 0.14285714
## suv 0.12903226 0.00000000 0.25806452 0.61290323
For column proportions, we use the prop.table() command, with the 2 argument following the table name.
prop.table(mpg_table, 2) #proportion of entire column
##
## 4 5 6 8
## 2seater 0.00000000 0.00000000 0.00000000 0.07142857
## compact 0.39506173 0.50000000 0.16455696 0.00000000
## midsize 0.19753086 0.00000000 0.29113924 0.02857143
## minivan 0.01234568 0.00000000 0.12658228 0.00000000
## pickup 0.03703704 0.00000000 0.12658228 0.28571429
## subcompact 0.25925926 0.50000000 0.08860759 0.07142857
## suv 0.09876543 0.00000000 0.20253165 0.54285714
The CrossTable() command from the gmodels package produces frequencies, and table, row, & column proportions with a single command. The values are not as quickly drawn into tables of their own, or further manipulated as they are with the dyplr/tidyr tables, but this is a handy command nonetheless.
install.packages("gmodels")
library(gmodels)
CrossTable(mpg$class, mpg$cyl)
##
##
## Cell Contents
## |-------------------------|
## | N |
## | Chi-square contribution |
## | N / Row Total |
## | N / Col Total |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 234
##
##
## | mpg$cyl
## mpg$class | 4 | 5 | 6 | 8 | Row Total |
## -------------|-----------|-----------|-----------|-----------|-----------|
## 2seater | 0 | 0 | 0 | 5 | 5 |
## | 1.731 | 0.085 | 1.688 | 8.210 | |
## | 0.000 | 0.000 | 0.000 | 1.000 | 0.021 |
## | 0.000 | 0.000 | 0.000 | 0.071 | |
## | 0.000 | 0.000 | 0.000 | 0.021 | |
## -------------|-----------|-----------|-----------|-----------|-----------|
## compact | 32 | 2 | 13 | 0 | 47 |
## | 15.210 | 1.782 | 0.518 | 14.060 | |
## | 0.681 | 0.043 | 0.277 | 0.000 | 0.201 |
## | 0.395 | 0.500 | 0.165 | 0.000 | |
## | 0.137 | 0.009 | 0.056 | 0.000 | |
## -------------|-----------|-----------|-----------|-----------|-----------|
## midsize | 16 | 0 | 23 | 2 | 41 |
## | 0.230 | 0.701 | 6.059 | 8.591 | |
## | 0.390 | 0.000 | 0.561 | 0.049 | 0.175 |
## | 0.198 | 0.000 | 0.291 | 0.029 | |
## | 0.068 | 0.000 | 0.098 | 0.009 | |
## -------------|-----------|-----------|-----------|-----------|-----------|
## minivan | 1 | 0 | 10 | 0 | 11 |
## | 2.070 | 0.188 | 10.641 | 3.291 | |
## | 0.091 | 0.000 | 0.909 | 0.000 | 0.047 |
## | 0.012 | 0.000 | 0.127 | 0.000 | |
## | 0.004 | 0.000 | 0.043 | 0.000 | |
## -------------|-----------|-----------|-----------|-----------|-----------|
## pickup | 3 | 0 | 10 | 20 | 33 |
## | 6.211 | 0.564 | 0.117 | 10.391 | |
## | 0.091 | 0.000 | 0.303 | 0.606 | 0.141 |
## | 0.037 | 0.000 | 0.127 | 0.286 | |
## | 0.013 | 0.000 | 0.043 | 0.085 | |
## -------------|-----------|-----------|-----------|-----------|-----------|
## subcompact | 21 | 2 | 7 | 5 | 35 |
## | 6.515 | 3.284 | 1.963 | 2.858 | |
## | 0.600 | 0.057 | 0.200 | 0.143 | 0.150 |
## | 0.259 | 0.500 | 0.089 | 0.071 | |
## | 0.090 | 0.009 | 0.030 | 0.021 | |
## -------------|-----------|-----------|-----------|-----------|-----------|
## suv | 8 | 0 | 16 | 38 | 62 |
## | 8.444 | 1.060 | 1.162 | 20.403 | |
## | 0.129 | 0.000 | 0.258 | 0.613 | 0.265 |
## | 0.099 | 0.000 | 0.203 | 0.543 | |
## | 0.034 | 0.000 | 0.068 | 0.162 | |
## -------------|-----------|-----------|-----------|-----------|-----------|
## Column Total | 81 | 4 | 79 | 70 | 234 |
## | 0.346 | 0.017 | 0.338 | 0.299 | |
## -------------|-----------|-----------|-----------|-----------|-----------|
##
##