Image courtesy of Allison Horst’s Twitter: @allison_horst

1 Introduction

1.1 Why dplyr?

Powerful but efficient
- Consistent syntax
- Fast
Function chaining
- Works well with entire tidyverse suite

data("iris")
library(tidyverse)

## ── Attaching packages ────────────────── tidyverse 1.2.1 ──

## ✔ ggplot2 3.1.0     ✔ purrr   0.3.0
## ✔ tibble  1.4.2     ✔ dplyr   0.7.8
## ✔ tidyr   0.8.2     ✔ stringr 1.3.1
## ✔ readr   1.3.1     ✔ forcats 0.3.0

## ── Conflicts ───────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()

1.2 Tibbles

An “update” to the data.frame object class in R
Updates relevant for using dplyr:
- Vectors of length 1 are automatically recycled
- Newly created vectors can be referenced in the same line of code
Other perks: only the first 10 lines print to the screen so your console doesn’t get overloaded
Read more about tibbles here: https://r4ds.had.co.nz/tibbles.html
dplyr automatically converts objects to tbl_df (tibble data frame) objects

1.3 Piping

%>% operator from library(magitrr)
Use %>% to send an object (typically a dataframe) to the next function
The function you pipe to will use the object in front of the %>% as its first argument

iris %>% head() # equivalent to head(iris)

iris %>% head(n=3) # equivalent to head(iris, n=3)

If you don’t want it to be the first argument, use .

iris %>% lm(Sepal.Width ~ Sepal.Length, data=.)

## 
## Call:
## lm(formula = Sepal.Width ~ Sepal.Length, data = .)
## 
## Coefficients:
##  (Intercept)  Sepal.Length  
##      3.41895      -0.06188

Shortcut for %>% is CTRL + SHIFT + M (or CMD + SHIFT + M for OSX)
A function that takes a data frame as the first argument, eg. head(), is called a verb
The entire tidyverse suite operates under the verb function structure, making piping especially convenient

2 Main verbs

2.1 `arrange()`

Sort data frame by column(s), lowest to highest

arrange(iris, Petal.Length)

If you specify a variable of class factor or character, you will rearrange the rows to alphabetical order
Use desc() if you want the opposite order

arrange(iris, desc(Species), # sort z to a since species is a factor
        Sepal.Width)

2.2 `mutate()`

Add a new variable (while preserving all existing variables)

mutate(iris, logSepLength = log(Sepal.Length))

There’s also transmute() which deletes the old column(s) you use to make the new column

iris %>% transmute(Length_diff = Sepal.Length - Petal.Length,
                   Width_diff = Sepal.Width - Petal.Width)

2.3 `filter()`

Return rows matching specified conditions
Use with >, >=, <, <=, |, !, %in%, ==, and !=. Separating conditions by , represents the & argument.

iris %>% filter(Sepal.Length >= 2)

iris %>% filter(Petal.Length >= mean(Petal.Length))

iris %>% filter(Species != "versicolor")

iris %>% filter(Species %in% c("versicolor", "setosa"))

2.4 `select()`

Keep only specified variables

select(iris, Sepal.Length, Sepal.Width)

Specify variables to exclude with -

select(iris, -Species)

Select a range of variables with :

select(iris, Sepal.Length:Petal.Length)

If you select just one column, you will still get a dataframe. If you need a vector, use pull()

pull(iris, Sepal.Length)

##   [1] 5.1 4.9 4.7 4.6 5.0 5.4 4.6 5.0 4.4 4.9 5.4 4.8 4.8 4.3 5.8 5.7 5.4
##  [18] 5.1 5.7 5.1 5.4 5.1 4.6 5.1 4.8 5.0 5.0 5.2 5.2 4.7 4.8 5.4 5.2 5.5
##  [35] 4.9 5.0 5.5 4.9 4.4 5.1 5.0 4.5 4.4 5.0 5.1 4.8 5.1 4.6 5.3 5.0 7.0
##  [52] 6.4 6.9 5.5 6.5 5.7 6.3 4.9 6.6 5.2 5.0 5.9 6.0 6.1 5.6 6.7 5.6 5.8
##  [69] 6.2 5.6 5.9 6.1 6.3 6.1 6.4 6.6 6.8 6.7 6.0 5.7 5.5 5.5 5.8 6.0 5.4
##  [86] 6.0 6.7 6.3 5.6 5.5 5.5 6.1 5.8 5.0 5.6 5.7 5.7 6.2 5.1 5.7 6.3 5.8
## [103] 7.1 6.3 6.5 7.6 4.9 7.3 6.7 7.2 6.5 6.4 6.8 5.7 5.8 6.4 6.5 7.7 7.7
## [120] 6.0 6.9 5.6 7.7 6.3 6.7 7.2 6.2 6.1 6.4 7.2 7.4 7.9 6.4 6.3 6.1 7.7
## [137] 6.3 6.4 6.0 6.9 6.7 6.9 5.8 6.8 6.7 6.7 6.3 6.5 6.2 5.9

2.5 `summarise()`

Condenses data down to one value per group

summarise(iris, mean(Petal.Length))

summarise(iris, sd_pl = sd(Petal.Length), var_pl = sd(Petal.Length)^2)

2.6 `group_by()`

Invisibly groups data by specified column(s)
Use with other verbs to get grouped information

iris %>% group_by(Species)

iris %>%
  group_by(Species) %>%
  summarise(mean(Petal.Length), sd(Petal.Length))

Data will remain grouped until you use ungroup()

2.7 `rename()`

Give your columns new names. Syntax is newColumn = oldColumn.

iris %>% rename(sep_len = Sepal.Length)

3 Helper functions

3.1 `everything()`

Move columns to the front of your data

select(iris, Species, everything())

3.2 `starts_with()`

select(iris, starts_with("Petal"))

3.3 `ends_with()`

select(iris, ends_with("Length"))

3.4 `contains()`

Searches column names for a specified string

select(iris, contains("Wid"))

3.5 `matches()`

Searches column names for a specified regular expression

select(iris, matches("wid|spec"))

For more info on regex see here

3.6 `row_number()`

Specify which row number you want for your verb

iris %>%
  group_by(Species) %>%
  filter(row_number() == 1)

3.7 `n()`

References the number of rows in your data frame (or for each group in a 'grouped_df')

iris %>%
  group_by(Species) %>%
  filter(row_number() == n())

4 Fancy verbs

Scoped verbs take the additional arguments .predicate, .funs, and .vars
They end in _at(), _if() and _all()
Signify what function (.funs) should be applied too all variables, only at certain variables (.vars), or only if variables meet a certain condition (.predicate)
ie. mutate_at(), mutate_if(), mutate_all(), summarise_if(), summarise_at(), summarise_all(), select_if(), select_at(), rename_if(), filter_all(), arrange_all(), group_by_at() …the list goes on!
A few examples (we will see more later)

iris %>% select_if(.predicate = is.numeric)

iris %>% select_if(.predicate = is.numeric, .funs=funs(paste0("num_",.))) # can also be used to rename

iris %>% summarise_if(.predicate = is.numeric, .funs = funs(mean(., na.rm=T)))

iris %>% mutate_at(.vars = vars(c("Sepal.Length","Petal.Length")),
                                         .funs = funs(scale))

# z scores by Species for all numeric variables
iris %>% group_by(Species) %>%
  mutate_if(.predicate = is.numeric, .funs = funs(scale))

I have never used any of the group_by_*() but I imagine they’re useful when you have a large selection of identifiers/grouping variables that you can call easily with a predicate (for example, is.factor). They can also be used with a .funs argument as a shortcut to group_by() %>% mutate().

5 Miscellaneous verbs

5.1 `lag()`

Makes a new column with the value of one row previously

5.2 `lead()`

Makes a new column with the value of one row ahead

x <- runif(5)
cbind(ahead = lead(x), x, behind = lag(x))

##           ahead          x     behind
## [1,] 0.01750923 0.69201174         NA
## [2,] 0.26307797 0.01750923 0.69201174
## [3,] 0.04004479 0.26307797 0.01750923
## [4,] 0.41882181 0.04004479 0.26307797
## [5,]         NA 0.41882181 0.04004479

Can choose a time column to order your new values by with the argument 'order_by'

Lag and lead are very useful for longitudinal models

5.3 `complete()`

incomplete_df <- data.frame(day = c(1,3,7,9), dose = c(0, 25, 40, 30))
incomplete_df

complete(incomplete_df, day = full_seq(1:max(day), 1))

5.4 `fill()`

Fill in missing values with values before or after

incomplete_df %>%
  complete(day = full_seq(1:max(day), 1)) %>%
  fill(dose, .direction = "down")

5.5 `drop_na()`

incomplete_df %>%
  complete(day = full_seq(1:max(day), 1)) %>%
  drop_na() # when you have more variables, specify which columns you care about dropping NAs from

5.6 `sample_frac()`

Randomly sample a specified fraction of rows of a data frame

mydat <- data.frame(id = sample(1:100, 20), meas = rnorm(20))
mydat

mydat %>% sample_frac(size = .5)

mydat %>% sample_frac(size= .5, replace = T) # you can also add sampling weights

5.7 `sample_n()`

Randomly sample a specified number of rows of a data frame

mydat %>% sample_n(3)

6 Joining functions

x <- data.frame(id = c("A","B","C"), val = 1:3)
y <- data.frame(id = c("A","B","D"), val = c("T","F","T"))
x

##   id val
## 1  A   1
## 2  B   2
## 3  C   3

##   id val
## 1  A   T
## 2  B   F
## 3  D   T

“Mutating” joins combine variables from the left and right hand sides ie. full_join(), inner_join(), right_join(), and left_join()

6.1 `full_join()`

Return all rows and columns

full_join(x, y, by = "id")

##   id val.x val.y
## 1  A     1     T
## 2  B     2     F
## 3  C     3  <NA>
## 4  D    NA     T

6.2 `inner_join()`

Return all rows from x that have a match in y, and all columns from x and y

inner_join(x, y, by = "id")

##   id val.x val.y
## 1  A     1     T
## 2  B     2     F

6.3 `left_join()`

Return all rows from x and all columns from x and y

left_join(x, y, by = "id")

##   id val.x val.y
## 1  A     1     T
## 2  B     2     F
## 3  C     3  <NA>

6.4 `right_join()`

Return all rows from x and all columns from x and y

right_join(x, y, by = "id")

##   id val.x val.y
## 1  A     1     T
## 2  B     2     F
## 3  D    NA     T

“Filtering” joins keep cases from the left hand side, ie. semi_join() and anti_join()

6.5 `semi_join()`

Return all rows in x that have a match in y, keeping only columns from y

semi_join(x, y, by = "id")

##   id val
## 1  A   1
## 2  B   2

6.6 `anti_join()`

Return all rows from x where there are not matching values in y, keeping just the columns from x

anti_join(x, y, by = "id")

##   id val
## 1  C   3

7 Incorporating dplyr into your workflow

7.1 Frequency Tables

library(kableExtra)
mtcars %>%
  group_by(vs, am) %>%
  summarise(n=n(), freq=n()/nrow(.)) %>%
  kable(caption="Frequency Table of vs and am", format="html") %>%
  kable_styling(c("condensed","responsive","striped","hover"), full_width = F)

Frequency Table of vs and am
vs	am	n	freq
0	0	12	0.37500
0	1	6	0.18750
1	0	7	0.21875
1	1	7	0.21875

7.2 compareGroups

library(compareGroups)
iris %>%
  mutate(Sepal.Length.Sq = Sepal.Length^2) %>%
  compareGroups(Species ~ ., data = .) %>%
  createTable() %>%
  export2md() %>%
  kable_styling(c("condensed","responsive","striped","hover"))

Summary descriptives table by groups of `Species’
	setosa	versicolor	virginica	p.overall
	N=50	N=50	N=50
Sepal.Length	5.01 (0.35)	5.94 (0.52)	6.59 (0.64)	<0.001
Sepal.Width	3.43 (0.38)	2.77 (0.31)	2.97 (0.32)	<0.001
Petal.Length	1.46 (0.17)	4.26 (0.47)	5.55 (0.55)	<0.001
Petal.Width	0.25 (0.11)	1.33 (0.20)	2.03 (0.27)	<0.001
Sepal.Length.Sq	25.2 (3.55)	35.5 (6.16)	43.8 (8.44)	<0.001

7.3 ggplotting

iris %>%
  filter(Species != "setosa") %>%
  ggplot(aes(Sepal.Length, Petal.Width, col=Species)) +
  geom_point()

7.4 Finding columns of interest

When you have too many column names to look through manually, search for a string or pattern of strings

load("dplyr_dat.Rdata")
length(names(brain))

## [1] 1000

select(brain, matches("pib.*parietal.*")) %>% names() # regex is not case sensitive

## [1] "PIB.PET_AAL__Parietal_Inf_L"        
## [2] "PIB.PET_AAL__Parietal_Inf_R"        
## [3] "PIB.PET_AAL__Parietal_Sup_L"        
## [4] "PIB.PET_AAL__Parietal_Sup_R"        
## [5] "PIB.PET_FS__ctx.lh.inferiorparietal"
## [6] "PIB.PET_FS__ctx.lh.superiorparietal"
## [7] "PIB.PET_FS__ctx.rh.inferiorparietal"
## [8] "PIB.PET_FS__ctx.rh.superiorparietal"

7.5 Missing data wrangling

Add a column flagging a value as missing and then replace the missing values with the mean of the other values

misdat <- data.frame(x1 = sample(c(1:3, NA), 13, replace=T),
                     x2 = sample(c(-5:-2,NA), 13, replace=T))
misdat

misdat %>%
  mutate_all(.funs = funs(miss = ifelse(is.na(.), 1, 0))) %>%
  mutate_all(.funs = funs(replace_na(., mean(., na.rm=T))))

7.6 Longitudinal data wrangling A

Calculate the time since a patient was first admitted to the hospital

id <- c(1,1,1,2,2,2,2,3)
admit <- as.Date(c("2017-06-22", "2017-07-13", "2017-08-29",
                   "2017-04-01", "2017-05-02", "2017-11-14", "2018-01-14",
                   "2019-01-01"))
discharge <- as.Date(c("2017-06-25", "2017-07-31", "2017-10-13",
                   "2017-04-02", "2017-05-10", "2017-11-18", "2018-02-12",
                   "2019-01-05"))
hosp_dat <- data.frame(id, admit, discharge)
hosp_dat

hosp_dat %>%
  group_by(id) %>%
  mutate(days_since_init_admit = discharge - admit[1])

7.7 Longitudinal data wrangling B

Make a row for every patient for every month from the start of follow up to the end of follow up (get equally spaced time intervals)
Make another column containing the drugs the patient was on previously (for prediction or longitudinal models)

load("meds.Rdata")

meds

meds %>%
    group_by(PatientID) %>%
    complete(Months_ImplantToVisit = full_seq(1:max(Months_ImplantToVisit), 1)) %>%
    fill(ends_with("_yn")) %>%
    fill(ends_with("_yn"), .direction="up") -> meds1
meds1

meds1 %>%
  mutate_at(.vars = vars(ends_with("_yn")),
              .funs = funs(prev = lag(., order_by = Months_ImplantToVisit))) %>%
    fill(ends_with("_prev"), .direction="up")

7.8 Survival data wrangling

Map four columns “days to…outcome” to a composite endpoint for a survival model
Record when the event occurred as Days.to.first.event and which event it was in FE.status

dems %>%
  select(Days.to.lastFU, Days.to.death, Days.to.stroke, Days.to.GIB, Days.to.PT) %>%
  # which.min cannot handle NA, so we'll make NA's infinity for now
  replace(is.na(.), Inf) %>%  
  # to allow which.min to search along rows (dplyr naturally looks down columns)
  rowwise() %>%
  mutate(
    FUorFEtime = pmin(Days.to.lastFU, Days.to.death, Days.to.stroke,
                      Days.to.GIB, Days.to.PT,
                      na.rm = T),
    # numbers correspond to order of the Days* columns
    FUorFEstatus = which.min(c(Days.to.lastFU, Days.to.death, Days.to.stroke,
                                    Days.to.GIB, Days.to.PT)),
    # condensed variable for survival model, 1 if any event
  Event_yn = ifelse(FUorFEstatus == 1, 0, 1)) -> dems_int

dems_int # look at the intermediate output

dems_int %>%
  # column names correspond to the order of the columns, rename
  mutate(FUorFEstatus = case_when(FUorFEstatus == 1 ~ "censored",
                                  FUorFEstatus == 2 ~ "death",
                                  FUorFEstatus == 3 ~ "stroke",
                                  FUorFEstatus == 4 ~ "gib",
                                  FUorFEstatus == 5 ~ "pt",
                                  TRUE ~ "error")) %>%
  # not case sensitive
  select(contains("fe"), Event_yn) %>%
  # allow for joining with no duplicate cols
  rownames_to_column() %>% 
  full_join(dems %>% rownames_to_column()) %>%
  select(-rowname) -> dems_clean

# check to make sure everything worked
dems_clean %>% filter(FUorFEstatus == "error")

dems_clean

7.9 Making functions

Problem: dplyr doesn’t know what to do with quotes around the variable name, but you can’t put the column name into a function without quotes because R will try to find it as an object in your environment…
Solution: relying on the rlang package (sym, !!, !!!, etc)
A good tutorial: http://jonthegeek.com/2018/06/04/writing-custom-tidyverse-functions/

irisSummary <- function(group){
    iris %>%
      group_by(!!sym(group)) %>%
      summarise(mean(Sepal.Length), sd(Sepal.Length))
}

irisSummary(group = "Species")

irisScatPlot <- function(x, y){
    iris %>%
    ggplot(aes_string(x, y, col="Species")) +
    geom_point() -> p
  return(p)
}

irisScatPlot(x="Sepal.Length",y="Petal.Length")

Data Wrangling with dplyr

WCM Biostatistics Computing Club

Katherine Hoffman, MS

March 5, 2019

1 Introduction

1.1 Why dplyr?

1.2 Tibbles

1.3 Piping

2 Main verbs

2.1 arrange()

2.2 mutate()

2.3 filter()

2.4 select()

2.5 summarise()

2.6 group_by()

2.7 rename()

3 Helper functions

3.1 everything()

3.2 starts_with()

3.3 ends_with()

3.4 contains()

3.5 matches()

3.6 row_number()

3.7 n()

4 Fancy verbs

5 Miscellaneous verbs

5.1 lag()

5.2 lead()

5.3 complete()

5.4 fill()

5.5 drop_na()

5.6 sample_frac()

5.7 sample_n()

6 Joining functions

6.1 full_join()

6.2 inner_join()

6.3 left_join()

6.4 right_join()

6.5 semi_join()

6.6 anti_join()

7 Incorporating dplyr into your workflow

7.1 Frequency Tables

7.2 compareGroups

7.3 ggplotting

7.4 Finding columns of interest

7.5 Missing data wrangling

7.6 Longitudinal data wrangling A

7.7 Longitudinal data wrangling B

7.8 Survival data wrangling

7.9 Making functions

2.1 `arrange()`

2.2 `mutate()`

2.3 `filter()`

2.4 `select()`

2.5 `summarise()`

2.6 `group_by()`

2.7 `rename()`

3.1 `everything()`

3.2 `starts_with()`

3.3 `ends_with()`

3.4 `contains()`

3.5 `matches()`

3.6 `row_number()`

3.7 `n()`

5.1 `lag()`

5.2 `lead()`

5.3 `complete()`

5.4 `fill()`

5.5 `drop_na()`

5.6 `sample_frac()`

5.7 `sample_n()`

6.1 `full_join()`

6.2 `inner_join()`

6.3 `left_join()`

6.4 `right_join()`

6.5 `semi_join()`

6.6 `anti_join()`