Introduction
Why dplyr?
Powerful but efficient
Function chaining
- Works well with entire tidyverse suite
data("iris")
library(tidyverse)
## ── Attaching packages ────────────────── tidyverse 1.2.1 ──
## ✔ ggplot2 3.1.0 ✔ purrr 0.3.0
## ✔ tibble 1.4.2 ✔ dplyr 0.7.8
## ✔ tidyr 0.8.2 ✔ stringr 1.3.1
## ✔ readr 1.3.1 ✔ forcats 0.3.0
## ── Conflicts ───────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
Tibbles
An “update” to the data.frame object class in R
Updates relevant for using dplyr:
Other perks: only the first 10 lines print to the screen so your console doesn’t get overloaded
Read more about tibbles here: https://r4ds.had.co.nz/tibbles.html
dplyr automatically converts objects to tbl_df (tibble data frame) objects
Piping
%>%
operator from library(magitrr)
Use %>%
to send an object (typically a dataframe) to the next function
The function you pipe to will use the object in front of the %>%
as its first argument
iris %>% head() # equivalent to head(iris)
iris %>% head(n=3) # equivalent to head(iris, n=3)
- If you don’t want it to be the first argument, use
.
iris %>% lm(Sepal.Width ~ Sepal.Length, data=.)
##
## Call:
## lm(formula = Sepal.Width ~ Sepal.Length, data = .)
##
## Coefficients:
## (Intercept) Sepal.Length
## 3.41895 -0.06188
Shortcut for %>%
is CTRL + SHIFT + M (or CMD + SHIFT + M for OSX)
A function that takes a data frame as the first argument, eg. head()
, is called a verb
The entire tidyverse suite operates under the verb function structure, making piping especially convenient
Main verbs
arrange()
- Sort data frame by column(s), lowest to highest
arrange(iris, Petal.Length)
If you specify a variable of class factor or character, you will rearrange the rows to alphabetical order
Use desc()
if you want the opposite order
arrange(iris, desc(Species), # sort z to a since species is a factor
Sepal.Width)
mutate()
- Add a new variable (while preserving all existing variables)
mutate(iris, logSepLength = log(Sepal.Length))
- There’s also
transmute()
which deletes the old column(s) you use to make the new column
iris %>% transmute(Length_diff = Sepal.Length - Petal.Length,
Width_diff = Sepal.Width - Petal.Width)
filter()
Return rows matching specified conditions
Use with >
, >=
, <
, <=
, |
, !
, %in%
, ==
, and !=
. Separating conditions by ,
represents the &
argument.
iris %>% filter(Sepal.Length >= 2)
iris %>% filter(Petal.Length >= mean(Petal.Length))
iris %>% filter(Species != "versicolor")
iris %>% filter(Species %in% c("versicolor", "setosa"))
select()
- Keep only specified variables
select(iris, Sepal.Length, Sepal.Width)
- Specify variables to exclude with
-
select(iris, -Species)
- Select a range of variables with
:
select(iris, Sepal.Length:Petal.Length)
- If you select just one column, you will still get a dataframe. If you need a vector, use
pull()
pull(iris, Sepal.Length)
## [1] 5.1 4.9 4.7 4.6 5.0 5.4 4.6 5.0 4.4 4.9 5.4 4.8 4.8 4.3 5.8 5.7 5.4
## [18] 5.1 5.7 5.1 5.4 5.1 4.6 5.1 4.8 5.0 5.0 5.2 5.2 4.7 4.8 5.4 5.2 5.5
## [35] 4.9 5.0 5.5 4.9 4.4 5.1 5.0 4.5 4.4 5.0 5.1 4.8 5.1 4.6 5.3 5.0 7.0
## [52] 6.4 6.9 5.5 6.5 5.7 6.3 4.9 6.6 5.2 5.0 5.9 6.0 6.1 5.6 6.7 5.6 5.8
## [69] 6.2 5.6 5.9 6.1 6.3 6.1 6.4 6.6 6.8 6.7 6.0 5.7 5.5 5.5 5.8 6.0 5.4
## [86] 6.0 6.7 6.3 5.6 5.5 5.5 6.1 5.8 5.0 5.6 5.7 5.7 6.2 5.1 5.7 6.3 5.8
## [103] 7.1 6.3 6.5 7.6 4.9 7.3 6.7 7.2 6.5 6.4 6.8 5.7 5.8 6.4 6.5 7.7 7.7
## [120] 6.0 6.9 5.6 7.7 6.3 6.7 7.2 6.2 6.1 6.4 7.2 7.4 7.9 6.4 6.3 6.1 7.7
## [137] 6.3 6.4 6.0 6.9 6.7 6.9 5.8 6.8 6.7 6.7 6.3 6.5 6.2 5.9
summarise()
- Condenses data down to one value per group
summarise(iris, mean(Petal.Length))
summarise(iris, sd_pl = sd(Petal.Length), var_pl = sd(Petal.Length)^2)
group_by()
iris %>% group_by(Species)
iris %>%
group_by(Species) %>%
summarise(mean(Petal.Length), sd(Petal.Length))
- Data will remain grouped until you use
ungroup()
rename()
- Give your columns new names. Syntax is newColumn = oldColumn.
iris %>% rename(sep_len = Sepal.Length)
Helper functions
everything()
- Move columns to the front of your data
select(iris, Species, everything())
starts_with()
select(iris, starts_with("Petal"))
ends_with()
select(iris, ends_with("Length"))
contains()
- Searches column names for a specified string
select(iris, contains("Wid"))
matches()
- Searches column names for a specified regular expression
select(iris, matches("wid|spec"))
For more info on regex see here
row_number()
- Specify which row number you want for your verb
iris %>%
group_by(Species) %>%
filter(row_number() == 1)
n()
- References the number of rows in your data frame (or for each group in a
'grouped_df'
)
iris %>%
group_by(Species) %>%
filter(row_number() == n())
Fancy verbs
Scoped verbs take the additional arguments .predicate, .funs, and .vars
They end in _at()
, _if()
and _all()
Signify what function (.funs) should be applied too all
variables, only at
certain variables (.vars), or only if
variables meet a certain condition (.predicate)
ie. mutate_at()
, mutate_if()
, mutate_all()
, summarise_if()
, summarise_at()
, summarise_all()
, select_if()
, select_at()
, rename_if()
, filter_all()
, arrange_all()
, group_by_at()
…the list goes on!
A few examples (we will see more later)
iris %>% select_if(.predicate = is.numeric)
iris %>% select_if(.predicate = is.numeric, .funs=funs(paste0("num_",.))) # can also be used to rename
iris %>% summarise_if(.predicate = is.numeric, .funs = funs(mean(., na.rm=T)))
iris %>% mutate_at(.vars = vars(c("Sepal.Length","Petal.Length")),
.funs = funs(scale))
# z scores by Species for all numeric variables
iris %>% group_by(Species) %>%
mutate_if(.predicate = is.numeric, .funs = funs(scale))
- I have never used any of the
group_by_*()
but I imagine they’re useful when you have a large selection of identifiers/grouping variables that you can call easily with a predicate (for example, is.factor
). They can also be used with a .funs argument as a shortcut to group_by() %>% mutate()
.
Miscellaneous verbs
lag()
- Makes a new column with the value of one row previously
lead()
- Makes a new column with the value of one row ahead
x <- runif(5)
cbind(ahead = lead(x), x, behind = lag(x))
## ahead x behind
## [1,] 0.01750923 0.69201174 NA
## [2,] 0.26307797 0.01750923 0.69201174
## [3,] 0.04004479 0.26307797 0.01750923
## [4,] 0.41882181 0.04004479 0.26307797
## [5,] NA 0.41882181 0.04004479
- Can choose a time column to order your new values by with the argument
'order_by'
- Lag and lead are very useful for longitudinal models
complete()
incomplete_df <- data.frame(day = c(1,3,7,9), dose = c(0, 25, 40, 30))
incomplete_df
complete(incomplete_df, day = full_seq(1:max(day), 1))
fill()
- Fill in missing values with values before or after
incomplete_df %>%
complete(day = full_seq(1:max(day), 1)) %>%
fill(dose, .direction = "down")
drop_na()
incomplete_df %>%
complete(day = full_seq(1:max(day), 1)) %>%
drop_na() # when you have more variables, specify which columns you care about dropping NAs from
sample_frac()
- Randomly sample a specified fraction of rows of a data frame
mydat <- data.frame(id = sample(1:100, 20), meas = rnorm(20))
mydat
mydat %>% sample_frac(size = .5)
mydat %>% sample_frac(size= .5, replace = T) # you can also add sampling weights
sample_n()
- Randomly sample a specified number of rows of a data frame
mydat %>% sample_n(3)
Joining functions
x <- data.frame(id = c("A","B","C"), val = 1:3)
y <- data.frame(id = c("A","B","D"), val = c("T","F","T"))
x
## id val
## 1 A 1
## 2 B 2
## 3 C 3
y
## id val
## 1 A T
## 2 B F
## 3 D T
- “Mutating” joins combine variables from the left and right hand sides ie.
full_join()
, inner_join()
, right_join()
, and left_join()
full_join()
- Return all rows and columns
full_join(x, y, by = "id")
## id val.x val.y
## 1 A 1 T
## 2 B 2 F
## 3 C 3 <NA>
## 4 D NA T
inner_join()
- Return all rows from
x
that have a match in y
, and all columns from x
and y
inner_join(x, y, by = "id")
## id val.x val.y
## 1 A 1 T
## 2 B 2 F
left_join()
- Return all rows from
x
and all columns from x
and y
left_join(x, y, by = "id")
## id val.x val.y
## 1 A 1 T
## 2 B 2 F
## 3 C 3 <NA>
right_join()
- Return all rows from
x
and all columns from x
and y
right_join(x, y, by = "id")
## id val.x val.y
## 1 A 1 T
## 2 B 2 F
## 3 D NA T
- “Filtering” joins keep cases from the left hand side, ie.
semi_join()
and anti_join()
semi_join()
- Return all rows in
x
that have a match in y
, keeping only columns from y
semi_join(x, y, by = "id")
## id val
## 1 A 1
## 2 B 2
anti_join()
- Return all rows from
x
where there are not matching values in y
, keeping just the columns from x
anti_join(x, y, by = "id")
## id val
## 1 C 3
Incorporating dplyr into your workflow
Frequency Tables
library(kableExtra)
mtcars %>%
group_by(vs, am) %>%
summarise(n=n(), freq=n()/nrow(.)) %>%
kable(caption="Frequency Table of vs and am", format="html") %>%
kable_styling(c("condensed","responsive","striped","hover"), full_width = F)
Frequency Table of vs and am
vs
|
am
|
n
|
freq
|
0
|
0
|
12
|
0.37500
|
0
|
1
|
6
|
0.18750
|
1
|
0
|
7
|
0.21875
|
1
|
1
|
7
|
0.21875
|
compareGroups
library(compareGroups)
iris %>%
mutate(Sepal.Length.Sq = Sepal.Length^2) %>%
compareGroups(Species ~ ., data = .) %>%
createTable() %>%
export2md() %>%
kable_styling(c("condensed","responsive","striped","hover"))
Summary descriptives table by groups of `Species’
|
setosa
|
versicolor
|
virginica
|
p.overall
|
|
N=50
|
N=50
|
N=50
|
|
Sepal.Length
|
5.01 (0.35)
|
5.94 (0.52)
|
6.59 (0.64)
|
<0.001
|
Sepal.Width
|
3.43 (0.38)
|
2.77 (0.31)
|
2.97 (0.32)
|
<0.001
|
Petal.Length
|
1.46 (0.17)
|
4.26 (0.47)
|
5.55 (0.55)
|
<0.001
|
Petal.Width
|
0.25 (0.11)
|
1.33 (0.20)
|
2.03 (0.27)
|
<0.001
|
Sepal.Length.Sq
|
25.2 (3.55)
|
35.5 (6.16)
|
43.8 (8.44)
|
<0.001
|
ggplotting
iris %>%
filter(Species != "setosa") %>%
ggplot(aes(Sepal.Length, Petal.Width, col=Species)) +
geom_point()
Finding columns of interest
- When you have too many column names to look through manually, search for a string or pattern of strings
load("dplyr_dat.Rdata")
length(names(brain))
## [1] 1000
select(brain, matches("pib.*parietal.*")) %>% names() # regex is not case sensitive
## [1] "PIB.PET_AAL__Parietal_Inf_L"
## [2] "PIB.PET_AAL__Parietal_Inf_R"
## [3] "PIB.PET_AAL__Parietal_Sup_L"
## [4] "PIB.PET_AAL__Parietal_Sup_R"
## [5] "PIB.PET_FS__ctx.lh.inferiorparietal"
## [6] "PIB.PET_FS__ctx.lh.superiorparietal"
## [7] "PIB.PET_FS__ctx.rh.inferiorparietal"
## [8] "PIB.PET_FS__ctx.rh.superiorparietal"
Missing data wrangling
- Add a column flagging a value as missing and then replace the missing values with the mean of the other values
misdat <- data.frame(x1 = sample(c(1:3, NA), 13, replace=T),
x2 = sample(c(-5:-2,NA), 13, replace=T))
misdat
misdat %>%
mutate_all(.funs = funs(miss = ifelse(is.na(.), 1, 0))) %>%
mutate_all(.funs = funs(replace_na(., mean(., na.rm=T))))
Longitudinal data wrangling A
- Calculate the time since a patient was first admitted to the hospital
id <- c(1,1,1,2,2,2,2,3)
admit <- as.Date(c("2017-06-22", "2017-07-13", "2017-08-29",
"2017-04-01", "2017-05-02", "2017-11-14", "2018-01-14",
"2019-01-01"))
discharge <- as.Date(c("2017-06-25", "2017-07-31", "2017-10-13",
"2017-04-02", "2017-05-10", "2017-11-18", "2018-02-12",
"2019-01-05"))
hosp_dat <- data.frame(id, admit, discharge)
hosp_dat
hosp_dat %>%
group_by(id) %>%
mutate(days_since_init_admit = discharge - admit[1])
Longitudinal data wrangling B
Make a row for every patient for every month from the start of follow up to the end of follow up (get equally spaced time intervals)
Make another column containing the drugs the patient was on previously (for prediction or longitudinal models)
load("meds.Rdata")
meds
meds %>%
group_by(PatientID) %>%
complete(Months_ImplantToVisit = full_seq(1:max(Months_ImplantToVisit), 1)) %>%
fill(ends_with("_yn")) %>%
fill(ends_with("_yn"), .direction="up") -> meds1
meds1
meds1 %>%
mutate_at(.vars = vars(ends_with("_yn")),
.funs = funs(prev = lag(., order_by = Months_ImplantToVisit))) %>%
fill(ends_with("_prev"), .direction="up")
Survival data wrangling
dems %>%
select(Days.to.lastFU, Days.to.death, Days.to.stroke, Days.to.GIB, Days.to.PT) %>%
# which.min cannot handle NA, so we'll make NA's infinity for now
replace(is.na(.), Inf) %>%
# to allow which.min to search along rows (dplyr naturally looks down columns)
rowwise() %>%
mutate(
FUorFEtime = pmin(Days.to.lastFU, Days.to.death, Days.to.stroke,
Days.to.GIB, Days.to.PT,
na.rm = T),
# numbers correspond to order of the Days* columns
FUorFEstatus = which.min(c(Days.to.lastFU, Days.to.death, Days.to.stroke,
Days.to.GIB, Days.to.PT)),
# condensed variable for survival model, 1 if any event
Event_yn = ifelse(FUorFEstatus == 1, 0, 1)) -> dems_int
dems_int # look at the intermediate output
dems_int %>%
# column names correspond to the order of the columns, rename
mutate(FUorFEstatus = case_when(FUorFEstatus == 1 ~ "censored",
FUorFEstatus == 2 ~ "death",
FUorFEstatus == 3 ~ "stroke",
FUorFEstatus == 4 ~ "gib",
FUorFEstatus == 5 ~ "pt",
TRUE ~ "error")) %>%
# not case sensitive
select(contains("fe"), Event_yn) %>%
# allow for joining with no duplicate cols
rownames_to_column() %>%
full_join(dems %>% rownames_to_column()) %>%
select(-rowname) -> dems_clean
# check to make sure everything worked
dems_clean %>% filter(FUorFEstatus == "error")
dems_clean
Making functions
Problem: dplyr doesn’t know what to do with quotes around the variable name, but you can’t put the column name into a function without quotes because R will try to find it as an object in your environment…
- Solution: relying on the rlang package (sym, !!, !!!, etc)
A good tutorial: http://jonthegeek.com/2018/06/04/writing-custom-tidyverse-functions/
irisSummary <- function(group){
iris %>%
group_by(!!sym(group)) %>%
summarise(mean(Sepal.Length), sd(Sepal.Length))
}
irisSummary(group = "Species")
irisScatPlot <- function(x, y){
iris %>%
ggplot(aes_string(x, y, col="Species")) +
geom_point() -> p
return(p)
}
irisScatPlot(x="Sepal.Length",y="Petal.Length")