Image courtesy of Allison Horst’s Twitter: @allison_horst

1 Introduction

1.1 Why dplyr?

  • Powerful but efficient

    • Consistent syntax

    • Fast

  • Function chaining

    • Works well with entire tidyverse suite
data("iris")
library(tidyverse)
## ── Attaching packages ────────────────── tidyverse 1.2.1 ──
## ✔ ggplot2 3.1.0     ✔ purrr   0.3.0
## ✔ tibble  1.4.2     ✔ dplyr   0.7.8
## ✔ tidyr   0.8.2     ✔ stringr 1.3.1
## ✔ readr   1.3.1     ✔ forcats 0.3.0
## ── Conflicts ───────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()

1.2 Tibbles

  • An “update” to the data.frame object class in R

  • Updates relevant for using dplyr:

    • Vectors of length 1 are automatically recycled

    • Newly created vectors can be referenced in the same line of code

  • Other perks: only the first 10 lines print to the screen so your console doesn’t get overloaded

  • Read more about tibbles here: https://r4ds.had.co.nz/tibbles.html

  • dplyr automatically converts objects to tbl_df (tibble data frame) objects

1.3 Piping

  • %>% operator from library(magitrr)

  • Use %>% to send an object (typically a dataframe) to the next function

  • The function you pipe to will use the object in front of the %>% as its first argument

iris %>% head() # equivalent to head(iris)
iris %>% head(n=3) # equivalent to head(iris, n=3)
  • If you don’t want it to be the first argument, use .
iris %>% lm(Sepal.Width ~ Sepal.Length, data=.)
## 
## Call:
## lm(formula = Sepal.Width ~ Sepal.Length, data = .)
## 
## Coefficients:
##  (Intercept)  Sepal.Length  
##      3.41895      -0.06188
  • Shortcut for %>% is CTRL + SHIFT + M (or CMD + SHIFT + M for OSX)

  • A function that takes a data frame as the first argument, eg. head(), is called a verb

  • The entire tidyverse suite operates under the verb function structure, making piping especially convenient

2 Main verbs

2.1 arrange()

  • Sort data frame by column(s), lowest to highest
arrange(iris, Petal.Length) 
  • If you specify a variable of class factor or character, you will rearrange the rows to alphabetical order

  • Use desc() if you want the opposite order

arrange(iris, desc(Species), # sort z to a since species is a factor
        Sepal.Width)  

2.2 mutate()

  • Add a new variable (while preserving all existing variables)
mutate(iris, logSepLength = log(Sepal.Length)) 
  • There’s also transmute() which deletes the old column(s) you use to make the new column
iris %>% transmute(Length_diff = Sepal.Length - Petal.Length,
                   Width_diff = Sepal.Width - Petal.Width) 

2.3 filter()

  • Return rows matching specified conditions

  • Use with >, >=, <, <=, |, !, %in%, ==, and !=. Separating conditions by , represents the & argument.

iris %>% filter(Sepal.Length >= 2) 
iris %>% filter(Petal.Length >= mean(Petal.Length)) 
iris %>% filter(Species != "versicolor") 
iris %>% filter(Species %in% c("versicolor", "setosa"))

2.4 select()

  • Keep only specified variables
select(iris, Sepal.Length, Sepal.Width) 
  • Specify variables to exclude with -
select(iris, -Species) 
  • Select a range of variables with :
select(iris, Sepal.Length:Petal.Length) 
  • If you select just one column, you will still get a dataframe. If you need a vector, use pull()
pull(iris, Sepal.Length) 
##   [1] 5.1 4.9 4.7 4.6 5.0 5.4 4.6 5.0 4.4 4.9 5.4 4.8 4.8 4.3 5.8 5.7 5.4
##  [18] 5.1 5.7 5.1 5.4 5.1 4.6 5.1 4.8 5.0 5.0 5.2 5.2 4.7 4.8 5.4 5.2 5.5
##  [35] 4.9 5.0 5.5 4.9 4.4 5.1 5.0 4.5 4.4 5.0 5.1 4.8 5.1 4.6 5.3 5.0 7.0
##  [52] 6.4 6.9 5.5 6.5 5.7 6.3 4.9 6.6 5.2 5.0 5.9 6.0 6.1 5.6 6.7 5.6 5.8
##  [69] 6.2 5.6 5.9 6.1 6.3 6.1 6.4 6.6 6.8 6.7 6.0 5.7 5.5 5.5 5.8 6.0 5.4
##  [86] 6.0 6.7 6.3 5.6 5.5 5.5 6.1 5.8 5.0 5.6 5.7 5.7 6.2 5.1 5.7 6.3 5.8
## [103] 7.1 6.3 6.5 7.6 4.9 7.3 6.7 7.2 6.5 6.4 6.8 5.7 5.8 6.4 6.5 7.7 7.7
## [120] 6.0 6.9 5.6 7.7 6.3 6.7 7.2 6.2 6.1 6.4 7.2 7.4 7.9 6.4 6.3 6.1 7.7
## [137] 6.3 6.4 6.0 6.9 6.7 6.9 5.8 6.8 6.7 6.7 6.3 6.5 6.2 5.9

2.5 summarise()

  • Condenses data down to one value per group
summarise(iris, mean(Petal.Length))
summarise(iris, sd_pl = sd(Petal.Length), var_pl = sd(Petal.Length)^2)

2.6 group_by()

  • Invisibly groups data by specified column(s)

  • Use with other verbs to get grouped information

iris %>% group_by(Species) 
iris %>%
  group_by(Species) %>%
  summarise(mean(Petal.Length), sd(Petal.Length))
  • Data will remain grouped until you use ungroup()

2.7 rename()

  • Give your columns new names. Syntax is newColumn = oldColumn.
iris %>% rename(sep_len = Sepal.Length) 

3 Helper functions

3.1 everything()

  • Move columns to the front of your data
select(iris, Species, everything()) 

3.2 starts_with()

select(iris, starts_with("Petal")) 

3.3 ends_with()

select(iris, ends_with("Length")) 

3.4 contains()

  • Searches column names for a specified string
select(iris, contains("Wid")) 

3.5 matches()

  • Searches column names for a specified regular expression
select(iris, matches("wid|spec")) 

For more info on regex see here

3.6 row_number()

  • Specify which row number you want for your verb
iris %>%
  group_by(Species) %>%
  filter(row_number() == 1)

3.7 n()

  • References the number of rows in your data frame (or for each group in a 'grouped_df')
iris %>%
  group_by(Species) %>%
  filter(row_number() == n())

4 Fancy verbs

  • Scoped verbs take the additional arguments .predicate, .funs, and .vars

  • They end in _at(), _if() and _all()

  • Signify what function (.funs) should be applied too all variables, only at certain variables (.vars), or only if variables meet a certain condition (.predicate)

  • ie. mutate_at(), mutate_if(), mutate_all(), summarise_if(), summarise_at(), summarise_all(), select_if(), select_at(), rename_if(), filter_all(), arrange_all(), group_by_at() …the list goes on!

  • A few examples (we will see more later)

iris %>% select_if(.predicate = is.numeric)
iris %>% select_if(.predicate = is.numeric, .funs=funs(paste0("num_",.))) # can also be used to rename
iris %>% summarise_if(.predicate = is.numeric, .funs = funs(mean(., na.rm=T)))
iris %>% mutate_at(.vars = vars(c("Sepal.Length","Petal.Length")),
                                         .funs = funs(scale))
# z scores by Species for all numeric variables
iris %>% group_by(Species) %>%
  mutate_if(.predicate = is.numeric, .funs = funs(scale))
  • I have never used any of the group_by_*() but I imagine they’re useful when you have a large selection of identifiers/grouping variables that you can call easily with a predicate (for example, is.factor). They can also be used with a .funs argument as a shortcut to group_by() %>% mutate().

5 Miscellaneous verbs

5.1 lag()

  • Makes a new column with the value of one row previously

5.2 lead()

  • Makes a new column with the value of one row ahead
x <- runif(5)
cbind(ahead = lead(x), x, behind = lag(x))
##           ahead          x     behind
## [1,] 0.01750923 0.69201174         NA
## [2,] 0.26307797 0.01750923 0.69201174
## [3,] 0.04004479 0.26307797 0.01750923
## [4,] 0.41882181 0.04004479 0.26307797
## [5,]         NA 0.41882181 0.04004479
  • Can choose a time column to order your new values by with the argument 'order_by'
  • Lag and lead are very useful for longitudinal models

5.3 complete()

incomplete_df <- data.frame(day = c(1,3,7,9), dose = c(0, 25, 40, 30))
incomplete_df
complete(incomplete_df, day = full_seq(1:max(day), 1))

5.4 fill()

  • Fill in missing values with values before or after
incomplete_df %>%
  complete(day = full_seq(1:max(day), 1)) %>%
  fill(dose, .direction = "down")

5.5 drop_na()

incomplete_df %>%
  complete(day = full_seq(1:max(day), 1)) %>%
  drop_na() # when you have more variables, specify which columns you care about dropping NAs from

5.6 sample_frac()

  • Randomly sample a specified fraction of rows of a data frame
mydat <- data.frame(id = sample(1:100, 20), meas = rnorm(20))
mydat
mydat %>% sample_frac(size = .5)
mydat %>% sample_frac(size= .5, replace = T) # you can also add sampling weights

5.7 sample_n()

  • Randomly sample a specified number of rows of a data frame
mydat %>% sample_n(3)

6 Joining functions

x <- data.frame(id = c("A","B","C"), val = 1:3)
y <- data.frame(id = c("A","B","D"), val = c("T","F","T"))
x
##   id val
## 1  A   1
## 2  B   2
## 3  C   3
y
##   id val
## 1  A   T
## 2  B   F
## 3  D   T
  • “Mutating” joins combine variables from the left and right hand sides ie. full_join(), inner_join(), right_join(), and left_join()

6.1 full_join()

  • Return all rows and columns
full_join(x, y, by = "id")
##   id val.x val.y
## 1  A     1     T
## 2  B     2     F
## 3  C     3  <NA>
## 4  D    NA     T

6.2 inner_join()

  • Return all rows from x that have a match in y, and all columns from x and y
inner_join(x, y, by = "id")
##   id val.x val.y
## 1  A     1     T
## 2  B     2     F

6.3 left_join()

  • Return all rows from x and all columns from x and y
left_join(x, y, by = "id")
##   id val.x val.y
## 1  A     1     T
## 2  B     2     F
## 3  C     3  <NA>

6.4 right_join()

  • Return all rows from x and all columns from x and y
right_join(x, y, by = "id")
##   id val.x val.y
## 1  A     1     T
## 2  B     2     F
## 3  D    NA     T
  • “Filtering” joins keep cases from the left hand side, ie. semi_join() and anti_join()

6.5 semi_join()

  • Return all rows in x that have a match in y, keeping only columns from y
semi_join(x, y, by = "id")
##   id val
## 1  A   1
## 2  B   2

6.6 anti_join()

  • Return all rows from x where there are not matching values in y, keeping just the columns from x
anti_join(x, y, by = "id")
##   id val
## 1  C   3

7 Incorporating dplyr into your workflow

7.1 Frequency Tables

library(kableExtra)
mtcars %>%
  group_by(vs, am) %>%
  summarise(n=n(), freq=n()/nrow(.)) %>%
  kable(caption="Frequency Table of vs and am", format="html") %>%
  kable_styling(c("condensed","responsive","striped","hover"), full_width = F)
Frequency Table of vs and am
vs am n freq
0 0 12 0.37500
0 1 6 0.18750
1 0 7 0.21875
1 1 7 0.21875

7.2 compareGroups

library(compareGroups)
iris %>%
  mutate(Sepal.Length.Sq = Sepal.Length^2) %>%
  compareGroups(Species ~ ., data = .) %>%
  createTable() %>%
  export2md() %>%
  kable_styling(c("condensed","responsive","striped","hover"))
Summary descriptives table by groups of `Species’
setosa versicolor virginica p.overall
N=50 N=50 N=50
Sepal.Length 5.01 (0.35) 5.94 (0.52) 6.59 (0.64) <0.001
Sepal.Width 3.43 (0.38) 2.77 (0.31) 2.97 (0.32) <0.001
Petal.Length 1.46 (0.17) 4.26 (0.47) 5.55 (0.55) <0.001
Petal.Width 0.25 (0.11) 1.33 (0.20) 2.03 (0.27) <0.001
Sepal.Length.Sq 25.2 (3.55) 35.5 (6.16) 43.8 (8.44) <0.001

7.3 ggplotting

iris %>%
  filter(Species != "setosa") %>%
  ggplot(aes(Sepal.Length, Petal.Width, col=Species)) +
  geom_point()

7.4 Finding columns of interest

  • When you have too many column names to look through manually, search for a string or pattern of strings
load("dplyr_dat.Rdata")
length(names(brain))
## [1] 1000
select(brain, matches("pib.*parietal.*")) %>% names() # regex is not case sensitive
## [1] "PIB.PET_AAL__Parietal_Inf_L"        
## [2] "PIB.PET_AAL__Parietal_Inf_R"        
## [3] "PIB.PET_AAL__Parietal_Sup_L"        
## [4] "PIB.PET_AAL__Parietal_Sup_R"        
## [5] "PIB.PET_FS__ctx.lh.inferiorparietal"
## [6] "PIB.PET_FS__ctx.lh.superiorparietal"
## [7] "PIB.PET_FS__ctx.rh.inferiorparietal"
## [8] "PIB.PET_FS__ctx.rh.superiorparietal"

7.5 Missing data wrangling

  • Add a column flagging a value as missing and then replace the missing values with the mean of the other values
misdat <- data.frame(x1 = sample(c(1:3, NA), 13, replace=T),
                     x2 = sample(c(-5:-2,NA), 13, replace=T))
misdat
misdat %>%
  mutate_all(.funs = funs(miss = ifelse(is.na(.), 1, 0))) %>%
  mutate_all(.funs = funs(replace_na(., mean(., na.rm=T))))

7.6 Longitudinal data wrangling A

  • Calculate the time since a patient was first admitted to the hospital
id <- c(1,1,1,2,2,2,2,3)
admit <- as.Date(c("2017-06-22", "2017-07-13", "2017-08-29",
                   "2017-04-01", "2017-05-02", "2017-11-14", "2018-01-14",
                   "2019-01-01"))
discharge <- as.Date(c("2017-06-25", "2017-07-31", "2017-10-13",
                   "2017-04-02", "2017-05-10", "2017-11-18", "2018-02-12",
                   "2019-01-05"))
hosp_dat <- data.frame(id, admit, discharge)
hosp_dat
hosp_dat %>%
  group_by(id) %>%
  mutate(days_since_init_admit = discharge - admit[1])

7.7 Longitudinal data wrangling B

  • Make a row for every patient for every month from the start of follow up to the end of follow up (get equally spaced time intervals)

  • Make another column containing the drugs the patient was on previously (for prediction or longitudinal models)

load("meds.Rdata")
meds
meds %>%
    group_by(PatientID) %>%
    complete(Months_ImplantToVisit = full_seq(1:max(Months_ImplantToVisit), 1)) %>%
    fill(ends_with("_yn")) %>%
    fill(ends_with("_yn"), .direction="up") -> meds1
meds1
meds1 %>%
  mutate_at(.vars = vars(ends_with("_yn")),
              .funs = funs(prev = lag(., order_by = Months_ImplantToVisit))) %>%
    fill(ends_with("_prev"), .direction="up")

7.8 Survival data wrangling

  • Map four columns “days to…outcome” to a composite endpoint for a survival model

  • Record when the event occurred as Days.to.first.event and which event it was in FE.status

dems %>%
  select(Days.to.lastFU, Days.to.death, Days.to.stroke, Days.to.GIB, Days.to.PT) %>%
  # which.min cannot handle NA, so we'll make NA's infinity for now
  replace(is.na(.), Inf) %>%  
  # to allow which.min to search along rows (dplyr naturally looks down columns)
  rowwise() %>%
  mutate(
    FUorFEtime = pmin(Days.to.lastFU, Days.to.death, Days.to.stroke,
                      Days.to.GIB, Days.to.PT,
                      na.rm = T),
    # numbers correspond to order of the Days* columns
    FUorFEstatus = which.min(c(Days.to.lastFU, Days.to.death, Days.to.stroke,
                                    Days.to.GIB, Days.to.PT)),
    # condensed variable for survival model, 1 if any event
  Event_yn = ifelse(FUorFEstatus == 1, 0, 1)) -> dems_int

dems_int # look at the intermediate output
dems_int %>%
  # column names correspond to the order of the columns, rename
  mutate(FUorFEstatus = case_when(FUorFEstatus == 1 ~ "censored",
                                  FUorFEstatus == 2 ~ "death",
                                  FUorFEstatus == 3 ~ "stroke",
                                  FUorFEstatus == 4 ~ "gib",
                                  FUorFEstatus == 5 ~ "pt",
                                  TRUE ~ "error")) %>%
  # not case sensitive
  select(contains("fe"), Event_yn) %>%
  # allow for joining with no duplicate cols
  rownames_to_column() %>% 
  full_join(dems %>% rownames_to_column()) %>%
  select(-rowname) -> dems_clean

# check to make sure everything worked
dems_clean %>% filter(FUorFEstatus == "error")
dems_clean

7.9 Making functions

  • Problem: dplyr doesn’t know what to do with quotes around the variable name, but you can’t put the column name into a function without quotes because R will try to find it as an object in your environment…

  • Solution: relying on the rlang package (sym, !!, !!!, etc)
  • A good tutorial: http://jonthegeek.com/2018/06/04/writing-custom-tidyverse-functions/

irisSummary <- function(group){
    iris %>%
      group_by(!!sym(group)) %>%
      summarise(mean(Sepal.Length), sd(Sepal.Length))
}

irisSummary(group = "Species")
irisScatPlot <- function(x, y){
    iris %>%
    ggplot(aes_string(x, y, col="Species")) +
    geom_point() -> p
  return(p)
}

irisScatPlot(x="Sepal.Length",y="Petal.Length")