A Data Frame is a list of vectors (columns of data) of equal length.
Example:
# Vectors
name <- c("Jane", "Jill", "Jim", "Joe")
age <- c("24", "24", "27", "26")
sex <- c("Female", "Female", "Male", "Male")
# Data Frame
dataframe <- data.frame(name, age, sex)
c=() is the concatonate command, which holds together multiple values in a common vector
The str() function will describe the dataframe, its variables, and the levels of each variable.
'data.frame': 4 obs. of 3 variables:
$ name: Factor w/ 4 levels "Alex","Kelvin",..: 3 4 2 1
$ age : Factor w/ 3 levels "24","26","27": 1 1 3 2
$ sex : Factor w/ 2 levels "Female","Male": 1 1 2 2
We can use the print() command to view the dataframe.
print(dataframe)
name age sex
1 Jane 24 Female
2 Jill 24 Female
3 Jim 27 Male
4 Joe 26 Male
We can print a single column.
dataframe["age"]
age
1 24
2 24
3 27
4 26
We can identify the column by name(above), or index number(below).
dataframe[2]
age
1 24
2 24
3 27
4 26
We can view a single row.
dataframe[3, ]
name age sex
3 Kelvin 27 Male
Or multiple rows.
dataframe[1:3, ]
name age sex
1 Kenia 24 Female
2 Sabrina 24 Female
3 Kelvin 27 Male
# Don't forget the comma!
To load a dataframe from a file on your computer
(You will need the file path for your data file)
mydata <- read.csv("filepath") #Read a .csv file
mydata <- read.delim("filepath") #Read other table data formats
Load the dplyr package. The dplyr package contains the filter() command that will be used in this tutorial.
install.packages("dplyr") #Install the package if you haven't before.
library(dplyr) #Load the package
Load the starwars dataframe.
data(starwars)
print(starwars)
# A tibble: 87 x 13
name height mass hair_color skin_color eye_color
<chr> <int> <dbl> <chr> <chr> <chr>
1 Luke Skywalker 172 77 blond fair blue
2 C-3PO 167 75 <NA> gold yellow
3 R2-D2 96 32 <NA> white, blue red
4 Darth Vader 202 136 none white yellow
5 Leia Organa 150 49 brown light brown
6 Owen Lars 178 120 brown, grey light blue
7 Beru Whitesun lars 165 75 brown light blue
8 R5-D4 97 32 <NA> white, red red
9 Biggs Darklighter 183 84 black light brown
10 Obi-Wan Kenobi 182 77 auburn, white fair blue-gray
# ... with 77 more rows, and 7 more variables: birth_year <dbl>,
# gender <chr>, homeworld <chr>, species <chr>, films <list>,
# vehicles <list>, starships <list>
Wait: Before we begin filtering, you should understand how the Magrittr package allows us to use pipes (“%>%”) to tie together multiple dplyr functions into a single executable string of commands.
Below, we can both (1)define a new dataframe, and (2)print the tablewith one single script of dplyr commands that are piped together with Magrittr “%>%” piping. Any commands piped together in this script will operate on the dataframe named at the top of the script, in the order in which the commands appear in text.
new_starwars <- starwars %>%
print()
# A tibble: 87 x 13
name height mass hair_color skin_color eye_color
<chr> <int> <dbl> <chr> <chr> <chr>
1 Luke Skywalker 172 77 blond fair blue
2 C-3PO 167 75 <NA> gold yellow
3 R2-D2 96 32 <NA> white, blue red
4 Darth Vader 202 136 none white yellow
5 Leia Organa 150 49 brown light brown
6 Owen Lars 178 120 brown, grey light blue
7 Beru Whitesun lars 165 75 brown light blue
8 R5-D4 97 32 <NA> white, red red
9 Biggs Darklighter 183 84 black light brown
10 Obi-Wan Kenobi 182 77 auburn, white fair blue-gray
# ... with 77 more rows, and 7 more variables: birth_year <dbl>,
# gender <chr>, homeworld <chr>, species <chr>, films <list>,
# vehicles <list>, starships <list>
We can filter observations based on numeric values by using the various comparative operators in r. Here are a few:
Description | Syntax |
---|---|
Less than | VariableName < Value |
Less than or Equal to | VariableName <= Value |
Greater than | VariableName > Value |
Greater than or Equal to | VariableName >= Value |
Equal to | VariableName == Value |
Not Less than | !VariableName < Value |
Not Less than or Equal to | !VariableName <= Value |
Not Greater than | !VariableName > Value |
Not Greater than or Equal to | !VariableName >= Value |
Not Equal to | !VariableName = Value |
We always need to use two “=” signs (==) when making comparisons. The single “=” takes on different purposes in different contexts.
Let’s filter “starwars” to only show characters who are more than 150cm tall.
We want to keep the changes we make, so we will have to define a new object. If we define a new object with the same name as the previous object, we overwrite the previous object.
tall_characters <- starwars %>%
filter(height > 150) %>%
print()
# A tibble: 69 x 13
name height mass hair_color skin_color eye_color
<chr> <int> <dbl> <chr> <chr> <chr>
1 Luke Skywalker 172 77 blond fair blue
2 C-3PO 167 75 <NA> gold yellow
3 Darth Vader 202 136 none white yellow
4 Owen Lars 178 120 brown, grey light blue
5 Beru Whitesun lars 165 75 brown light blue
6 Biggs Darklighter 183 84 black light brown
7 Obi-Wan Kenobi 182 77 auburn, white fair blue-gray
8 Anakin Skywalker 188 84 blond fair blue
9 Wilhuff Tarkin 180 NA auburn, grey fair blue
10 Chewbacca 228 112 brown unknown blue
# ... with 59 more rows, and 7 more variables: birth_year <dbl>,
# gender <chr>, homeworld <chr>, species <chr>, films <list>,
# vehicles <list>, starships <list>
We can filter based on results of computations between multiple variables in R. Some of the mathematical operators available in R are:
Description | Operator |
---|---|
Addition | + |
Subtraction | - |
Multiplication | * |
Division | / |
Square Root | sqrt(x) |
Absolute Value | abs(x) |
Exponentiation | ^ |
Natural Exponential Function | exp(x) |
Natural Logarithm | log(x) |
Common Logarithm | log10(x) |
Let’s filter “starwars” to only show characters whose height/weight ratio is greater than or equal to 3/1.
tall_and_skinny_characters <- tall_characters %>%
filter(height/mass >= 3) %>%
print()
# A tibble: 9 x 13
name height mass hair_color skin_color eye_color
<chr> <int> <dbl> <chr> <chr> <chr>
1 Ayla Secura 178 55.0 none blue hazel
2 Adi Gallia 184 50.0 none dark blue
3 Luminara Unduli 170 56.2 black yellow blue
4 Barriss Offee 166 50.0 black yellow blue
5 Zam Wesell 168 55.0 blonde fair, green, yellow yellow
6 Wat Tambor 193 48.0 none green, grey unknown
7 Shaak Ti 178 57.0 none red, blue, white black
8 Sly Moore 178 48.0 none pale white
9 Padmé Amidala 165 45.0 brown light brown
# ... with 7 more variables: birth_year <dbl>, gender <chr>,
# homeworld <chr>, species <chr>, films <list>, vehicles <list>,
# starships <list>
We can filter based on exact matches of string values.
Let’s filter to only show human characters.
starwars %>%
filter(species == "Human") %>%
print()
# A tibble: 35 x 13
name height mass hair_color skin_color eye_color
<chr> <int> <dbl> <chr> <chr> <chr>
1 Luke Skywalker 172 77 blond fair blue
2 Darth Vader 202 136 none white yellow
3 Leia Organa 150 49 brown light brown
4 Owen Lars 178 120 brown, grey light blue
5 Beru Whitesun lars 165 75 brown light blue
6 Biggs Darklighter 183 84 black light brown
7 Obi-Wan Kenobi 182 77 auburn, white fair blue-gray
8 Anakin Skywalker 188 84 blond fair blue
9 Wilhuff Tarkin 180 NA auburn, grey fair blue
10 Han Solo 180 80 brown fair brown
# ... with 25 more rows, and 7 more variables: birth_year <dbl>,
# gender <chr>, homeworld <chr>, species <chr>, films <list>,
# vehicles <list>, starships <list>
The stringr() package provides a neat and simple experience for dealing with strings.
We can filter the data to only show rows containing “Skywalker” by using the str_detect() function from the stringr package.
library(stringr)
starwars %>%
filter(str_detect(name, "Skywalker"))
# A tibble: 3 x 13
name height mass hair_color skin_color eye_color birth_year
<chr> <int> <dbl> <chr> <chr> <chr> <dbl>
1 Luke Skywalker 172 77 blond fair blue 19.0
2 Anakin Skywalker 188 84 blond fair blue 41.9
3 Shmi Skywalker 163 NA black fair brown 72.0
# ... with 6 more variables: gender <chr>, homeworld <chr>, species <chr>,
# films <list>, vehicles <list>, starships <list>
See the stringr() package documentation for more functions and package details.
We can also use negative statements to filter data by adding an “!” before the evaluation.
Let’s filter for all rows where the species is not human.
starwars %>%
filter(!species == "Human") %>%
print()
# A tibble: 47 x 13
name height mass hair_color skin_color
<chr> <int> <dbl> <chr> <chr>
1 C-3PO 167 75 <NA> gold
2 R2-D2 96 32 <NA> white, blue
3 R5-D4 97 32 <NA> white, red
4 Chewbacca 228 112 brown unknown
5 Greedo 173 74 <NA> green
6 Jabba Desilijic Tiure 175 1358 <NA> green-tan, brown
7 Yoda 66 17 white green
8 IG-88 200 140 none metal
9 Bossk 190 113 none green
10 Ackbar 180 83 none brown mottle
# ... with 37 more rows, and 8 more variables: eye_color <chr>,
# birth_year <dbl>, gender <chr>, homeworld <chr>, species <chr>,
# films <list>, vehicles <list>, starships <list>
So far, we have been filtering out rows or observations. We can also chose which columns we want to display in a dataframe by using a combination of the the subset() & select() functions.
Let’s only keep the columns for name, eye color, species, and homeworld.
starwars %>%
subset(select = c("name", "eye_color", "species", "homeworld")) %>%
print()
# A tibble: 87 x 4
name eye_color species homeworld
<chr> <chr> <chr> <chr>
1 Luke Skywalker blue Human Tatooine
2 C-3PO yellow Droid Tatooine
3 R2-D2 red Droid Naboo
4 Darth Vader yellow Human Tatooine
5 Leia Organa brown Human Alderaan
6 Owen Lars blue Human Tatooine
7 Beru Whitesun lars blue Human Tatooine
8 R5-D4 red Droid Tatooine
9 Biggs Darklighter brown Human Tatooine
10 Obi-Wan Kenobi blue-gray Human Stewjon
# ... with 77 more rows
We often need to produce summary information from our data. We can use the summary() function to produce some general summary stats.
summary(starwars)
name height mass hair_color
Length:87 Min. : 66.0 Min. : 15.00 Length:87
Class :character 1st Qu.:167.0 1st Qu.: 55.60 Class :character
Mode :character Median :180.0 Median : 79.00 Mode :character
Mean :174.4 Mean : 97.31
3rd Qu.:191.0 3rd Qu.: 84.50
skin_color eye_color birth_year gender
Length:87 Length:87 Min. : 8.00 Length:87
Class :character Class :character 1st Qu.: 35.00 Class :character
Mode :character Mode :character Median : 52.00 Mode :character
Mean : 87.57
3rd Qu.: 72.00
homeworld species
Length:87 Length:87
Class :character Class :character
Mode :character Mode :character
films.Length films.Class films.Mode
5 -none- character
6 -none- character
7 -none- character
4 -none- character
5 -none- character
vehicles.Length vehicles.Class vehicles.Mode
2 -none- character
0 -none- character
0 -none- character
0 -none- character
1 -none- character
starships.Length starships.Class starships.Mode
2 -none- character
0 -none- character
0 -none- character
1 -none- character
0 -none- character
[ reached getOption("max.print") -- omitted 82 rows ]
While this can show us number of observations, variables classes, and Min/Max/Median/Mean and quartile values for our numeric variables, sometimes we need specific summary values which are not auto-generated by such functions.
By combining the group_by() and summarize() functions from the dplyr package, we can easily create summary information from our raw data.
Let’s use group_by() and summarize() to compute average height by species.
starwars %>%
group_by(species) %>%
summarize(avgheight = mean(height))
# A tibble: 38 x 2
species avgheight
<chr> <dbl>
1 Aleena 79.0000
2 Besalisk 198.0000
3 Cerean 198.0000
4 Chagrian 196.0000
5 Clawdite 168.0000
6 Droid NA
7 Dug 112.0000
8 Ewok 88.0000
9 Geonosian 183.0000
10 Gungan 208.6667
# ... with 28 more rows
We can pass the n() argument to get a count of the rows that are collapsed into each summary row.
Let’s compute the number of characters of each species that appear in this dataset.
starwars %>%
group_by(species) %>%
summarize(number_of_characters = n())
# A tibble: 38 x 2
species number_of_characters
<chr> <int>
1 Aleena 1
2 Besalisk 1
3 Cerean 1
4 Chagrian 1
5 Clawdite 1
6 Droid 5
7 Dug 1
8 Ewok 1
9 Geonosian 1
10 Gungan 3
# ... with 28 more rows
We can create new variables in a dataframe by using the mutate() function from the dplyr package.
# First, create a subset to work from.
starwars_subset <- starwars %>%
subset(select = c("name", "height", "mass"))
# Create new variable
starwars_subset %>%
mutate(new_variable = "hello")
# A tibble: 87 x 4
name height mass new_variable
<chr> <int> <dbl> <chr>
1 Luke Skywalker 172 77 hello
2 C-3PO 167 75 hello
3 R2-D2 96 32 hello
4 Darth Vader 202 136 hello
5 Leia Organa 150 49 hello
6 Owen Lars 178 120 hello
7 Beru Whitesun lars 165 75 hello
8 R5-D4 97 32 hello
9 Biggs Darklighter 183 84 hello
10 Obi-Wan Kenobi 182 77 hello
# ... with 77 more rows
Having the same value in every single observation is usually not very helpful. New variables can be created computationally by performing mathematical operations on existing variables.
starwars_subset%>%
mutate(double_height = height*2,
HeightWeightRatio = height/mass)
# A tibble: 87 x 5
name height mass double_height HeightWeightRatio
<chr> <int> <dbl> <dbl> <dbl>
1 Luke Skywalker 172 77 344 2.233766
2 C-3PO 167 75 334 2.226667
3 R2-D2 96 32 192 3.000000
4 Darth Vader 202 136 404 1.485294
5 Leia Organa 150 49 300 3.061224
6 Owen Lars 178 120 356 1.483333
7 Beru Whitesun lars 165 75 330 2.200000
8 R5-D4 97 32 194 3.031250
9 Biggs Darklighter 183 84 366 2.178571
10 Obi-Wan Kenobi 182 77 364 2.363636
# ... with 77 more rows
It is often useful to use the ifelse() function to make logically determined value assignments to a new variable. ifelse() requires 3 parameters: ifelse(evaluation, value if true, value if false).
Lets create a new variable using the ifelse() function to classify characters as tall or short.
starwars_subset %>%
mutate(tallorshort = ifelse(height >= 120, "tall", "short"))
# A tibble: 87 x 4
name height mass tallorshort
<chr> <int> <dbl> <chr>
1 Luke Skywalker 172 77 tall
2 C-3PO 167 75 tall
3 R2-D2 96 32 short
4 Darth Vader 202 136 tall
5 Leia Organa 150 49 tall
6 Owen Lars 178 120 tall
7 Beru Whitesun lars 165 75 tall
8 R5-D4 97 32 short
9 Biggs Darklighter 183 84 tall
10 Obi-Wan Kenobi 182 77 tall
# ... with 77 more rows
Hadley Wickham, Romain Francois, Lionel Henry and Kirill Müller (NA). dplyr: A Grammar of Data Manipulation. http://dplyr.tidyverse.org, https://github.com/tidyverse/dplyr.
Hadley Wickham (2017). stringr: Simple, Consistent Wrappers for Common String Operations. R package version 1.2.0. https://CRAN.R-project.org/package=stringr
Please feel free to leave feedback that could help improve this site. If you have questions, please leave them below as well and I will do my best to support you as soon as possible.