Data Frames

A Data Frame is a list of vectors (columns of data) of equal length.
Example:

# Vectors

name <- c("Jane", "Jill", "Jim", "Joe")
age <- c("24", "24", "27", "26")
sex <- c("Female", "Female", "Male", "Male")


# Data Frame

dataframe <- data.frame(name, age, sex)

c=() is the concatonate command, which holds together multiple values in a common vector

The str() function will describe the dataframe, its variables, and the levels of each variable.

'data.frame':   4 obs. of  3 variables:
 $ name: Factor w/ 4 levels "Alex","Kelvin",..: 3 4 2 1
 $ age : Factor w/ 3 levels "24","26","27": 1 1 3 2
 $ sex : Factor w/ 2 levels "Female","Male": 1 1 2 2

We can use the print() command to view the dataframe.

print(dataframe)

  name age    sex
1 Jane  24 Female
2 Jill  24 Female
3  Jim  27   Male
4  Joe  26   Male

We can print a single column.

dataframe["age"]

We can identify the column by name(above), or index number(below).

dataframe[2]

We can view a single row.

dataframe[3, ]

    name age  sex
3 Kelvin  27 Male

Or multiple rows.

dataframe[1:3, ]

     name age    sex
1   Kenia  24 Female
2 Sabrina  24 Female
3  Kelvin  27   Male

# Don't forget the comma!

To load a dataframe from a file on your computer

(You will need the file path for your data file)

mydata <- read.csv("filepath")  #Read a .csv file
mydata <- read.delim("filepath")  #Read other table data formats

Filter Data

Before we Begin

Load the dplyr package. The dplyr package contains the filter() command that will be used in this tutorial.

install.packages("dplyr")  #Install the package if you haven't before.
library(dplyr)  #Load the package

Load the starwars dataframe.

data(starwars)
print(starwars)

# A tibble: 87 x 13
                 name height  mass    hair_color  skin_color eye_color
                <chr>  <int> <dbl>         <chr>       <chr>     <chr>
 1     Luke Skywalker    172    77         blond        fair      blue
 2              C-3PO    167    75          <NA>        gold    yellow
 3              R2-D2     96    32          <NA> white, blue       red
 4        Darth Vader    202   136          none       white    yellow
 5        Leia Organa    150    49         brown       light     brown
 6          Owen Lars    178   120   brown, grey       light      blue
 7 Beru Whitesun lars    165    75         brown       light      blue
 8              R5-D4     97    32          <NA>  white, red       red
 9  Biggs Darklighter    183    84         black       light     brown
10     Obi-Wan Kenobi    182    77 auburn, white        fair blue-gray
# ... with 77 more rows, and 7 more variables: birth_year <dbl>,
#   gender <chr>, homeworld <chr>, species <chr>, films <list>,
#   vehicles <list>, starships <list>

About Magrittr

Wait: Before we begin filtering, you should understand how the Magrittr package allows us to use pipes (“%>%”) to tie together multiple dplyr functions into a single executable string of commands.

Below, we can both (1)define a new dataframe, and (2)print the tablewith one single script of dplyr commands that are piped together with Magrittr “%>%” piping. Any commands piped together in this script will operate on the dataframe named at the top of the script, in the order in which the commands appear in text.

new_starwars <- starwars %>% 
print()

# A tibble: 87 x 13
                 name height  mass    hair_color  skin_color eye_color
                <chr>  <int> <dbl>         <chr>       <chr>     <chr>
 1     Luke Skywalker    172    77         blond        fair      blue
 2              C-3PO    167    75          <NA>        gold    yellow
 3              R2-D2     96    32          <NA> white, blue       red
 4        Darth Vader    202   136          none       white    yellow
 5        Leia Organa    150    49         brown       light     brown
 6          Owen Lars    178   120   brown, grey       light      blue
 7 Beru Whitesun lars    165    75         brown       light      blue
 8              R5-D4     97    32          <NA>  white, red       red
 9  Biggs Darklighter    183    84         black       light     brown
10     Obi-Wan Kenobi    182    77 auburn, white        fair blue-gray
# ... with 77 more rows, and 7 more variables: birth_year <dbl>,
#   gender <chr>, homeworld <chr>, species <chr>, films <list>,
#   vehicles <list>, starships <list>

Filtering Observations

Filter: Numeric Values

We can filter observations based on numeric values by using the various comparative operators in r. Here are a few:

Description	Syntax
Less than	VariableName < Value
Less than or Equal to	VariableName <= Value
Greater than	VariableName > Value
Greater than or Equal to	VariableName >= Value
Equal to	VariableName == Value
Not Less than	!VariableName < Value
Not Less than or Equal to	!VariableName <= Value
Not Greater than	!VariableName > Value
Not Greater than or Equal to	!VariableName >= Value
Not Equal to	!VariableName = Value

We always need to use two “=” signs (==) when making comparisons. The single “=” takes on different purposes in different contexts.

Let’s filter “starwars” to only show characters who are more than 150cm tall.

We want to keep the changes we make, so we will have to define a new object. If we define a new object with the same name as the previous object, we overwrite the previous object.

tall_characters <- starwars %>% 
filter(height > 150) %>% 
print()

# A tibble: 69 x 13
                 name height  mass    hair_color skin_color eye_color
                <chr>  <int> <dbl>         <chr>      <chr>     <chr>
 1     Luke Skywalker    172    77         blond       fair      blue
 2              C-3PO    167    75          <NA>       gold    yellow
 3        Darth Vader    202   136          none      white    yellow
 4          Owen Lars    178   120   brown, grey      light      blue
 5 Beru Whitesun lars    165    75         brown      light      blue
 6  Biggs Darklighter    183    84         black      light     brown
 7     Obi-Wan Kenobi    182    77 auburn, white       fair blue-gray
 8   Anakin Skywalker    188    84         blond       fair      blue
 9     Wilhuff Tarkin    180    NA  auburn, grey       fair      blue
10          Chewbacca    228   112         brown    unknown      blue
# ... with 59 more rows, and 7 more variables: birth_year <dbl>,
#   gender <chr>, homeworld <chr>, species <chr>, films <list>,
#   vehicles <list>, starships <list>

Filter: Based on Computations

We can filter based on results of computations between multiple variables in R. Some of the mathematical operators available in R are:

Description	Operator
Addition	+
Subtraction	-
Multiplication	*
Division	/
Square Root	sqrt(x)
Absolute Value	abs(x)
Exponentiation	^
Natural Exponential Function	exp(x)
Natural Logarithm	log(x)
Common Logarithm	log10(x)

Let’s filter “starwars” to only show characters whose height/weight ratio is greater than or equal to 3/1.

tall_and_skinny_characters <- tall_characters %>% 
filter(height/mass >= 3) %>% 
print()

# A tibble: 9 x 13
             name height  mass hair_color          skin_color eye_color
            <chr>  <int> <dbl>      <chr>               <chr>     <chr>
1     Ayla Secura    178  55.0       none                blue     hazel
2      Adi Gallia    184  50.0       none                dark      blue
3 Luminara Unduli    170  56.2      black              yellow      blue
4   Barriss Offee    166  50.0      black              yellow      blue
5      Zam Wesell    168  55.0     blonde fair, green, yellow    yellow
6      Wat Tambor    193  48.0       none         green, grey   unknown
7        Shaak Ti    178  57.0       none    red, blue, white     black
8       Sly Moore    178  48.0       none                pale     white
9   Padmé Amidala    165  45.0      brown               light     brown
# ... with 7 more variables: birth_year <dbl>, gender <chr>,
#   homeworld <chr>, species <chr>, films <list>, vehicles <list>,
#   starships <list>

Filter: Strings

We can filter based on exact matches of string values.

Let’s filter to only show human characters.

starwars %>% 
filter(species == "Human") %>% 
print()

# A tibble: 35 x 13
                 name height  mass    hair_color skin_color eye_color
                <chr>  <int> <dbl>         <chr>      <chr>     <chr>
 1     Luke Skywalker    172    77         blond       fair      blue
 2        Darth Vader    202   136          none      white    yellow
 3        Leia Organa    150    49         brown      light     brown
 4          Owen Lars    178   120   brown, grey      light      blue
 5 Beru Whitesun lars    165    75         brown      light      blue
 6  Biggs Darklighter    183    84         black      light     brown
 7     Obi-Wan Kenobi    182    77 auburn, white       fair blue-gray
 8   Anakin Skywalker    188    84         blond       fair      blue
 9     Wilhuff Tarkin    180    NA  auburn, grey       fair      blue
10           Han Solo    180    80         brown       fair     brown
# ... with 25 more rows, and 7 more variables: birth_year <dbl>,
#   gender <chr>, homeworld <chr>, species <chr>, films <list>,
#   vehicles <list>, starships <list>

String Matching

The stringr() package provides a neat and simple experience for dealing with strings.

We can filter the data to only show rows containing “Skywalker” by using the str_detect() function from the stringr package.

library(stringr)

starwars %>% 
filter(str_detect(name, "Skywalker"))

# A tibble: 3 x 13
              name height  mass hair_color skin_color eye_color birth_year
             <chr>  <int> <dbl>      <chr>      <chr>     <chr>      <dbl>
1   Luke Skywalker    172    77      blond       fair      blue       19.0
2 Anakin Skywalker    188    84      blond       fair      blue       41.9
3   Shmi Skywalker    163    NA      black       fair     brown       72.0
# ... with 6 more variables: gender <chr>, homeworld <chr>, species <chr>,
#   films <list>, vehicles <list>, starships <list>

See the stringr() package documentation for more functions and package details.

Filter: “Not Equal To”

We can also use negative statements to filter data by adding an “!” before the evaluation.

Let’s filter for all rows where the species is not human.

starwars %>% 
filter(!species == "Human") %>% 
print()

# A tibble: 47 x 13
                    name height  mass hair_color       skin_color
                   <chr>  <int> <dbl>      <chr>            <chr>
 1                 C-3PO    167    75       <NA>             gold
 2                 R2-D2     96    32       <NA>      white, blue
 3                 R5-D4     97    32       <NA>       white, red
 4             Chewbacca    228   112      brown          unknown
 5                Greedo    173    74       <NA>            green
 6 Jabba Desilijic Tiure    175  1358       <NA> green-tan, brown
 7                  Yoda     66    17      white            green
 8                 IG-88    200   140       none            metal
 9                 Bossk    190   113       none            green
10                Ackbar    180    83       none     brown mottle
# ... with 37 more rows, and 8 more variables: eye_color <chr>,
#   birth_year <dbl>, gender <chr>, homeworld <chr>, species <chr>,
#   films <list>, vehicles <list>, starships <list>

Filtering Columns

So far, we have been filtering out rows or observations. We can also chose which columns we want to display in a dataframe by using a combination of the the subset() & select() functions.

Let’s only keep the columns for name, eye color, species, and homeworld.

starwars %>% 
subset(select = c("name", "eye_color", "species", "homeworld")) %>% 
print()

# A tibble: 87 x 4
                 name eye_color species homeworld
                <chr>     <chr>   <chr>     <chr>
 1     Luke Skywalker      blue   Human  Tatooine
 2              C-3PO    yellow   Droid  Tatooine
 3              R2-D2       red   Droid     Naboo
 4        Darth Vader    yellow   Human  Tatooine
 5        Leia Organa     brown   Human  Alderaan
 6          Owen Lars      blue   Human  Tatooine
 7 Beru Whitesun lars      blue   Human  Tatooine
 8              R5-D4       red   Droid  Tatooine
 9  Biggs Darklighter     brown   Human  Tatooine
10     Obi-Wan Kenobi blue-gray   Human   Stewjon
# ... with 77 more rows

Summarize Data

We often need to produce summary information from our data. We can use the summary() function to produce some general summary stats.

summary(starwars)

     name               height           mass          hair_color       
 Length:87          Min.   : 66.0   Min.   :  15.00   Length:87         
 Class :character   1st Qu.:167.0   1st Qu.:  55.60   Class :character  
 Mode  :character   Median :180.0   Median :  79.00   Mode  :character  
                    Mean   :174.4   Mean   :  97.31                     
                    3rd Qu.:191.0   3rd Qu.:  84.50                     
  skin_color         eye_color           birth_year        gender         
 Length:87          Length:87          Min.   :  8.00   Length:87         
 Class :character   Class :character   1st Qu.: 35.00   Class :character  
 Mode  :character   Mode  :character   Median : 52.00   Mode  :character  
                                       Mean   : 87.57                     
                                       3rd Qu.: 72.00                     
  homeworld           species         
 Length:87          Length:87         
 Class :character   Class :character  
 Mode  :character   Mode  :character  
                                      
                                      
 films.Length  films.Class  films.Mode
 5          -none-     character      
 6          -none-     character      
 7          -none-     character      
 4          -none-     character      
 5          -none-     character      
 vehicles.Length  vehicles.Class  vehicles.Mode
 2          -none-     character               
 0          -none-     character               
 0          -none-     character               
 0          -none-     character               
 1          -none-     character               
 starships.Length  starships.Class  starships.Mode
 2          -none-     character                  
 0          -none-     character                  
 0          -none-     character                  
 1          -none-     character                  
 0          -none-     character                  
 [ reached getOption("max.print") -- omitted 82 rows ]

While this can show us number of observations, variables classes, and Min/Max/Median/Mean and quartile values for our numeric variables, sometimes we need specific summary values which are not auto-generated by such functions.

group_by & summarize()

By combining the group_by() and summarize() functions from the dplyr package, we can easily create summary information from our raw data.

Let’s use group_by() and summarize() to compute average height by species.

starwars %>% 
group_by(species) %>% 
summarize(avgheight = mean(height))

# A tibble: 38 x 2
     species avgheight
       <chr>     <dbl>
 1    Aleena   79.0000
 2  Besalisk  198.0000
 3    Cerean  198.0000
 4  Chagrian  196.0000
 5  Clawdite  168.0000
 6     Droid        NA
 7       Dug  112.0000
 8      Ewok   88.0000
 9 Geonosian  183.0000
10    Gungan  208.6667
# ... with 28 more rows

We can pass the n() argument to get a count of the rows that are collapsed into each summary row.

Let’s compute the number of characters of each species that appear in this dataset.

starwars %>% 
group_by(species) %>% 
summarize(number_of_characters = n())

# A tibble: 38 x 2
     species number_of_characters
       <chr>                <int>
 1    Aleena                    1
 2  Besalisk                    1
 3    Cerean                    1
 4  Chagrian                    1
 5  Clawdite                    1
 6     Droid                    5
 7       Dug                    1
 8      Ewok                    1
 9 Geonosian                    1
10    Gungan                    3
# ... with 28 more rows

Create New Variables

We can create new variables in a dataframe by using the mutate() function from the dplyr package.

# First, create a subset to work from.

starwars_subset <- starwars %>% 
subset(select = c("name", "height", "mass"))

# Create new variable

starwars_subset %>% 
mutate(new_variable = "hello")

# A tibble: 87 x 4
                 name height  mass new_variable
                <chr>  <int> <dbl>        <chr>
 1     Luke Skywalker    172    77        hello
 2              C-3PO    167    75        hello
 3              R2-D2     96    32        hello
 4        Darth Vader    202   136        hello
 5        Leia Organa    150    49        hello
 6          Owen Lars    178   120        hello
 7 Beru Whitesun lars    165    75        hello
 8              R5-D4     97    32        hello
 9  Biggs Darklighter    183    84        hello
10     Obi-Wan Kenobi    182    77        hello
# ... with 77 more rows

Having the same value in every single observation is usually not very helpful. New variables can be created computationally by performing mathematical operations on existing variables.

starwars_subset%>%
  
    mutate(double_height = height*2,
  
                    HeightWeightRatio = height/mass)

# A tibble: 87 x 5
                 name height  mass double_height HeightWeightRatio
                <chr>  <int> <dbl>         <dbl>             <dbl>
 1     Luke Skywalker    172    77           344          2.233766
 2              C-3PO    167    75           334          2.226667
 3              R2-D2     96    32           192          3.000000
 4        Darth Vader    202   136           404          1.485294
 5        Leia Organa    150    49           300          3.061224
 6          Owen Lars    178   120           356          1.483333
 7 Beru Whitesun lars    165    75           330          2.200000
 8              R5-D4     97    32           194          3.031250
 9  Biggs Darklighter    183    84           366          2.178571
10     Obi-Wan Kenobi    182    77           364          2.363636
# ... with 77 more rows

ifelse()

It is often useful to use the ifelse() function to make logically determined value assignments to a new variable. ifelse() requires 3 parameters: ifelse(evaluation, value if true, value if false).

Lets create a new variable using the ifelse() function to classify characters as tall or short.

starwars_subset %>% 
mutate(tallorshort = ifelse(height >= 120, "tall", "short"))

# A tibble: 87 x 4
                 name height  mass tallorshort
                <chr>  <int> <dbl>       <chr>
 1     Luke Skywalker    172    77        tall
 2              C-3PO    167    75        tall
 3              R2-D2     96    32       short
 4        Darth Vader    202   136        tall
 5        Leia Organa    150    49        tall
 6          Owen Lars    178   120        tall
 7 Beru Whitesun lars    165    75        tall
 8              R5-D4     97    32       short
 9  Biggs Darklighter    183    84        tall
10     Obi-Wan Kenobi    182    77        tall
# ... with 77 more rows

References

Hadley Wickham, Romain Francois, Lionel Henry and Kirill Müller (NA). dplyr: A Grammar of Data Manipulation. http://dplyr.tidyverse.org, https://github.com/tidyverse/dplyr.

Hadley Wickham (2017). stringr: Simple, Consistent Wrappers for Common String Operations. R package version 1.2.0. https://CRAN.R-project.org/package=stringr

Questions & Feedback

Please feel free to leave feedback that could help improve this site. If you have questions, please leave them below as well and I will do my best to support you as soon as possible.

Comment Box is loading comments…