What is tidymodels?

(Thanks Kat!)

A meta-package composed of sub-packages that are desinged to make clean up the data wrangling and models-fitting processes
Consistent with the tidyverse syntax

Why should I use tidymodels?

It’s tidy! Clean code (less bugs)
It’s designed to make your life easier!
Fewer packages to find and remember
Less syntax to look up/understand (if you learn the tidymodels syntax)
It’s desinged to alleviate the programming burden of complex, iterative modeling and data wrangling tasks

Is it actually going to make my life easier?

Maybe…

If you know you’re going to be doing a very iterative modeling process
- If you’re using lots of different types of models
- If you’re tuning model parameters
- If you’re working through a rigorous variable selection process
- If you’re going to be using cross-validation to select a best fit
If you’re using machine learning/non-parametric models
If the purpose of your model is prediction
If you’re working with a large dataset
Any time you’re fitting the same model over and over again
Any time you’re conducting the same data wrangling process over and over again

Some key packages

library(tidymodels)
library(stringr)

The dataset we’ll be working with is found at http://insideairbnb.com/get-the-data.html.

temp = tempfile()
download.file("http://data.insideairbnb.com/united-states/ny/new-york-city/2019-06-02/data/listings.csv.gz", temp)

airbnb_data = gzfile(temp, 'rt')%>%
  read.csv()%>%
  as_tibble()

nrow(airbnb_data)

## [1] 48801

head(airbnb_data)

recipes

Creating a recipe

Step 1: Specify your model and dataset.

rec = recipe(price ~  host_since + 
               host_response_time + 
               host_response_rate + 
               host_is_superhost +
               host_listings_count + 
               host_has_profile_pic + 
               neighbourhood_group_cleansed + 
               zipcode + 
               room_type +
               accommodates + 
               bathrooms +
               bedrooms + 
               beds + 
               bed_type + 
               square_feet + 
               security_deposit +
               cleaning_fee + 
               guests_included +
               extra_people + 
               maximum_nights + 
               minimum_nights + 
               availability_365 +
               number_of_reviews + 
               last_review +
               review_scores_rating +
               review_scores_accuracy +
               review_scores_cleanliness +
               review_scores_checkin +
               review_scores_communication +
               review_scores_location +
               review_scores_value + 
               instant_bookable + 
               is_business_travel_ready +
               cancellation_policy + 
               require_guest_profile_picture + 
               require_guest_phone_verification + 
               calculated_host_listings_count + 
               reviews_per_month,
             data = airbnb_data)

rec

## Data Recipe
## 
## Inputs:
## 
##       role #variables
##    outcome          1
##  predictor         38

Recipe attributes:

- `var_info`
- `term_info`
- `steps`
- `template`
- `levels`
- `retained`

rec$var_info

rec$template

Step 2: Specify some data cleaning steps (write the recipe)

rec = rec %>%
  step_filter(neighbourhood_group_cleansed != "Staten Island")%>%
  
  step_mutate(price = as.character(price)%>%
                stringr::str_remove("\\$")%>%
                as.numeric(),
              
              cleaning_fee = as.character(cleaning_fee)%>%
                stringr::str_remove("\\$")%>%
                as.numeric(),
              
              security_deposit = as.character(security_deposit)%>%
                stringr::str_remove("\\$")%>%
                as.numeric(),
              
              extra_people = as.character(extra_people)%>%
                stringr::str_remove("\\$")%>%
                as.numeric(),
              
              host_since = (Sys.Date()-as.Date(host_since))%>%
                as.numeric(),
              
              host_response_rate = as.character(host_response_rate)%>%
                stringr::str_remove("%")%>%
                as.numeric(),
              
              host_response_time = ifelse(host_response_time == "N/A", NA, host_response_time))

rec

## Data Recipe
## 
## Inputs:
## 
##       role #variables
##    outcome          1
##  predictor         38
## 
## Operations:
## 
## Row filtering
## Variable mutation for  price, cleaning_fee, security_deposit, extra_people, host_since, host_response_rate, host_response_time

Step 3: Actually run the data cleaning steps (prep the recipe ingredients)

retain = TRUE overwrites the previous template

rec = prep(rec, retain = TRUE)

## Warning in function_list[[k]](value): NAs introduced by coercion

## Warning in function_list[[k]](value): NAs introduced by coercion

## Warning in function_list[[k]](value): NAs introduced by coercion

## Warning in function_list[[k]](value): NAs introduced by coercion

Step 4: Return a cleaned and ready to use tibble (bake!)

cleaned_airbnb_data = bake(rec, new_data = airbnb_data)
cleaned_airbnb_data

Some usefull features

Imputation

rec1 = rec%>%
  step_bagimpute(host_response_rate,
                 impute_with = imp_vars(host_since,
                                        host_response_time,
                                        host_is_superhost,
                                        host_listings_count,
                                        host_has_profile_pic),
                 options = list(nbagg = 5, keepX = FALSE))

rec1

## Data Recipe
## 
## Inputs:
## 
##       role #variables
##    outcome          1
##  predictor         38
## 
## Training data contained 48801 data points and 48439 incomplete rows. 
## 
## Operations:
## 
## Row filtering [trained]
## Variable mutation for  price, cleaning_fee, security_deposit, extra_people, host_since, host_response_rate, host_response_time [trained]
## Bagged tree imputation for host_response_rate

prep(rec1)$template

Other imputation methods:

step_knnimput()
step_lowerimpute()
step_meanimpute()
step_medianimpute()
step_modeimpute()
step_rollimpute()

Selector functions

rec2 = rec%>%
  step_bagimpute(host_response_rate,
                 impute_with = imp_vars(starts_with("host")))

Other selector functions

has_role()
has_type()
all_numeric()
all_predictors()
all_outcomes()
everything()
ends_with()
contains()
matches()
num_range()
one_of()

rec3 = rec%>%
  step_bagimpute(host_response_rate,
                 impute_with = imp_vars(starts_with("host") - all_nominal()))

Variable roles

rec1

## Data Recipe
## 
## Inputs:
## 
##       role #variables
##    outcome          1
##  predictor         38
## 
## Training data contained 48801 data points and 48439 incomplete rows. 
## 
## Operations:
## 
## Row filtering [trained]
## Variable mutation for  price, cleaning_fee, security_deposit, extra_people, host_since, host_response_rate, host_response_time [trained]
## Bagged tree imputation for host_response_rate

rec1 = rec1 %>%
  add_role(neighbourhood_group_cleansed, new_role = "strata")

rec1

## Data Recipe
## 
## Inputs:
## 
##       role #variables
##    outcome          1
##  predictor         38
##     strata          1
## 
## Training data contained 48801 data points and 48439 incomplete rows. 
## 
## Operations:
## 
## Row filtering [trained]
## Variable mutation for  price, cleaning_fee, security_deposit, extra_people, host_since, host_response_rate, host_response_time [trained]
## Bagged tree imputation for host_response_rate

Principle component analysis

rec1 = rec1 %>%
  step_meanimpute(starts_with("review"))%>%
  step_center(starts_with("review"))%>%
  step_scale(starts_with("review"))%>%
  step_pca(starts_with("review"), threshold = 0.95)%>%
  prep()


rec1$term_info

Other dimensionality reduction methods:

kernel PCA
ICA
Isomap
data depth features
class distances

Some other useful steps

`step_naomit()`

rec1 = rec1%>%
  step_naomit(all_predictors())

`step_BoxCox()`

rec1 = rec1%>%
  step_BoxCox(price)

`step_corr()`

rec1 = rec1%>%
  step_corr(all_numeric(), threshold = 0.5)

prep(rec1)

## Data Recipe
## 
## Inputs:
## 
##       role #variables
##    outcome          1
##  predictor         38
##     strata          1
## 
## Training data contained 48445 data points and 48232 incomplete rows. 
## 
## Operations:
## 
## Row filtering [trained]
## Variable mutation for  price, cleaning_fee, security_deposit, extra_people, host_since, host_response_rate, host_response_time [trained]
## Bagged tree imputation for host_response_rate [trained]
## Mean Imputation for 8 items [trained]
## Centering for 8 items [trained]
## Scaling for 8 items [trained]
## PCA extraction with 8 items [trained]
## Removing rows with NA values in all_predictors()
## Box-Cox transformation on price [trained]
## Correlation filter removed bedrooms, ... [trained]

`add_step()`

For for an example walking through how to create your own step, see https://tidymodels.github.io/recipes/articles/Custom_Steps.html

Our fully baked recipe

What we accomplished:

filtering dataset
cleaning up variables
imputed missing data for several variables
performed PCA
removed correlated data

cleaned_airbnb_data = recipe(price ~  host_since + 
               host_response_time + 
               host_response_rate + 
               host_is_superhost +
               host_listings_count + 
               host_has_profile_pic + 
               neighbourhood_group_cleansed + 
               zipcode + 
               room_type +
               accommodates + 
               bathrooms +
               bedrooms + 
               beds + 
               bed_type + 
               square_feet + 
               security_deposit +
               cleaning_fee + 
               guests_included +
               extra_people + 
               maximum_nights + 
               minimum_nights + 
               availability_365 +
               number_of_reviews + 
               last_review +
               review_scores_rating +
               review_scores_accuracy +
               review_scores_cleanliness +
               review_scores_checkin +
               review_scores_communication +
               review_scores_location +
               review_scores_value + 
               instant_bookable + 
               is_business_travel_ready +
               cancellation_policy + 
               require_guest_profile_picture + 
               require_guest_phone_verification + 
               calculated_host_listings_count + 
               reviews_per_month,
             data = airbnb_data)%>%
  

  step_filter(neighbourhood_group_cleansed != "Staten Island")%>%
  
  step_mutate(price = as.character(price)%>%
                stringr::str_remove("\\$")%>%
                as.numeric(),
              
              cleaning_fee = as.character(cleaning_fee)%>%
                stringr::str_remove("\\$")%>%
                as.numeric(),
              
              security_deposit = as.character(security_deposit)%>%
                stringr::str_remove("\\$")%>%
                as.numeric(),
              
              extra_people = as.character(extra_people)%>%
                stringr::str_remove("\\$")%>%
                as.numeric(),
              
              host_since = (Sys.Date()-as.Date(host_since))%>%
                as.numeric(),
              
              host_response_rate = as.character(host_response_rate)%>%
                stringr::str_remove("%")%>%
                as.numeric(),
              
              host_response_time = ifelse(host_response_time == "N/A", NA, host_response_time))%>%
  
  
  step_bagimpute(host_response_rate,
                 impute_with = imp_vars(host_since,
                                        host_response_time,
                                        host_is_superhost,
                                        host_listings_count,
                                        host_has_profile_pic),
                 options = list(nbagg = 5, keepX = FALSE))%>%
  
  add_role(neighbourhood_group_cleansed, new_role = "strata")%>%
  
  step_meanimpute(starts_with("review"))%>%
  
  step_center(starts_with("review"))%>%
  
  step_scale(starts_with("review"))%>%
  
  step_pca(starts_with("review"), threshold = 0.95)%>%
  
  step_naomit(everything())%>%
  
  step_corr(all_numeric(), threshold = 0.5)%>%
  
  step_mutate(price_greater_than_100 = factor(ifelse(price > 100, 1, 0)))%>%
  
  prep()%>%
  
  bake(new_data = airbnb_data)
    
cleaned_airbnb_data

parsnip

Syntax

Fit some simple models

logistic_regression = logistic_reg()%>% #define the model you're going to fit 
  set_engine("glm")%>% #choose the package that you want to use to fit the model 
  fit(price_greater_than_100 ~  host_is_superhost+
        host_response_rate+
        room_type+
        host_listings_count+
        number_of_reviews+
        PC1, data = cleaned_airbnb_data) #make a call to the data


linear_regression = linear_reg()%>% 
  set_engine("lm")%>% 
  fit_xy(x = cleaned_airbnb_data %>% 
           select(host_is_superhost,
                  host_response_rate,
                  room_type,
                  host_listings_count,
                  number_of_reviews,
                  PC1), 
         y = cleaned_airbnb_data$price)

tidy(linear_regression, conf.int = TRUE)

Works like recipes() and ggplot() - you can start specifying a model outline (“recipe”) before you make any calls to a dataset.

logistic_regression = logistic_reg()%>% 
  set_engine("glm")

logistic_regression

## Logistic Regression Model Specification (classification)
## 
## Computational engine: glm

translate() function lets you view the syntax of the underlying modeling function.

logistic_regression%>%
  translate()

## Logistic Regression Model Specification (classification)
## 
## Computational engine: glm 
## 
## Model fit template:
## stats::glm(formula = missing_arg(), data = missing_arg(), weights = missing_arg(), 
##     family = stats::binomial)

Methods implemented in `parsnip`

linear regression glmnet keras lm spark stan
logistic regression glm glmnet keras lm spark stan
multivariate adaptive regression earth
multinomial regression glmnet keras spark
parametric survival regression flexsurv survreg
boosted trees C5.0 spark xgboost
random forests randomForest ranger spark
decision trees C5.0 spark rpart
neural nets keras nnet
k nearest neighbors kknn
support vector machines kernlab

A note on survival modeling.

parsnip is only implemented for parametric time-to-event models (Weibull, exponential, etc.)

Important features

Making predictions without `parsnip` :(

airbnb_data = airbnb_data %>%
  mutate(price = as.character(price)%>%
           stringr::str_remove("\\$")%>%
           as.numeric())

nrow(airbnb_data)

## [1] 48801

predict_with_lm = lm(price %>% as.character()%>%
                       stringr::str_remove("\\$")%>%
                       as.numeric() ~ review_scores_accuracy+
                       review_scores_cleanliness+
                       review_scores_checkin+
                       review_scores_communication+
                       review_scores_location+
                       review_scores_value,
                     data = airbnb_data)

length(predict(predict_with_lm))

## [1] 37489

Making predictions WITH `parsnip` :)

predict_with_parsnip = linear_reg()%>% 
  set_engine("lm")%>% 
  fit(price ~ review_scores_accuracy+
                       review_scores_cleanliness+
                       review_scores_checkin+
                       review_scores_communication+
                       review_scores_location+
                       review_scores_value,
                     data = airbnb_data)%>%
  predict(airbnb_data)

predict_with_parsnip

Making prediction with `parsnip` when there are multiple predictions at each observation

penalized_regression = linear_reg()%>%
  set_engine("glmnet")%>%
  fit(price ~ review_scores_accuracy+
                       review_scores_cleanliness+
                       review_scores_checkin+
                       review_scores_communication+
                       review_scores_location+
                       review_scores_value, airbnb_data)

predicted_values = penalized_regression%>%
  multi_predict(airbnb_data)

predicted_values

predicted_values$.pred[2]

## [[1]]
## # A tibble: 64 x 2
##    penalty .pred
##      <dbl> <dbl>
##  1  0.0416  149.
##  2  0.0457  149.
##  3  0.0502  149.
##  4  0.0550  149.
##  5  0.0604  149.
##  6  0.0663  149.
##  7  0.0728  149.
##  8  0.0799  149.
##  9  0.0876  149.
## 10  0.0962  149.
## # ... with 54 more rows

unnested = predicted_values%>%
  unnest()

unnested

nrow(unnested)

## [1] 3123264

An Example: Stratified analysis

Let’s use `parsnip` and `recipes` together to conduct a stratified analysis

Step 1: Define a simple recipe

new_recipe = recipe(price ~ number_of_reviews +
                      review_scores_rating +
                      review_scores_accuracy +
                      review_scores_cleanliness +
                      review_scores_checkin +
                      review_scores_communication +
                      review_scores_location +
                      review_scores_value+
                      neighbourhood_group_cleansed, 
                    data = airbnb_data)%>%
  add_role(neighbourhood_group_cleansed, new_role = "strata")%>%
  step_mutate(price = as.character(price)%>%
                stringr::str_remove("\\$")%>%
                as.numeric())%>%
  prep()

Step 2: Write a wrapper function for your preprocessing and modeling recipes

get_linear_reg = function( recipe, data, strata, ...){
  prepped_data = recipe%>% 
    step_filter(neighbourhood_group_cleansed == strata, role = "strata")%>%
    prep()
  
  linear_regression = linear_reg()%>% 
  set_engine("lm")%>% 
  fit_xy(x = bake(prepped_data, data, all_predictors(), - has_role("strata")), 
         y = bake(prepped_data, data, all_outcomes()))
  
}

Step 3: Conduct your analysis

get_linear_reg(new_recipe, airbnb_data, "Manhattan")%>%
  tidy(conf.int = T)

Using `recipes` and `parsnip` for a nested stratified analysis

Step 1: Modify our wrapper

get_linear_reg = function(data, recipe,  ...){

  linear_regression = linear_reg()%>% 
  set_engine("lm")%>% 
  fit_xy(x = bake(recipe, data, all_predictors(), - has_role("strata")), 
         y = bake(recipe, data, all_outcomes()))
  
}

Nest the data

strat_analysis = airbnb_data%>%
  group_by(neighbourhood_group_cleansed)%>%
  nest()
strat_analysis

Step 2: Map the model to the strata

strat_analysis = airbnb_data%>%
  group_by(neighbourhood_group_cleansed)%>%
  nest()%>%
  mutate(stratified_models = map(data, get_linear_reg, new_recipe, airbnb_data))
strat_analysis

Step 3: Take a look at each model

strat_analysis$stratified_models[[1]]%>%
  tidy(conf.int = T)

rsample

Syntax

Creating a bootstrap

set.seed(2342)
airbnb_bootstraps = bootstraps(cleaned_airbnb_data, times = 10, strata = NULL)
airbnb_bootstraps

In the first item in the list, 213 is the size of the bootstrap sample, 82 is the number of out-of-bag observations, and 213 is the size of the original dataset

airbnb_bootstraps$splits

## [[1]]
## <213/82/213>
## 
## [[2]]
## <213/76/213>
## 
## [[3]]
## <213/84/213>
## 
## [[4]]
## <213/73/213>
## 
## [[5]]
## <213/79/213>
## 
## [[6]]
## <213/80/213>
## 
## [[7]]
## <213/83/213>
## 
## [[8]]
## <213/79/213>
## 
## [[9]]
## <213/76/213>
## 
## [[10]]
## <213/82/213>

Each element is an rsplit object

class(airbnb_bootstraps$splits[[1]])

## [1] "rsplit"     "boot_split"

analysis() gives the bootstraped set as a dataframe

analysis(airbnb_bootstraps$splits[[1]])

assessment() gives the out-of-bag set as a dataframe

assessment(airbnb_bootstraps$splits[[1]])

Creating a bootstrap with strata

set.seed(346234)

strat_airbnb_boots =  bootstraps(cleaned_airbnb_data, times = 100, strata = "neighbourhood_group_cleansed")
strat_airbnb_boots

Creating data folds (for cross validation)

set.seed(86869)
airbnb_folds = vfold_cv(cleaned_airbnb_data, times = 10)
airbnb_folds

Some other re-sampling methods

Rolling origin forecast resampling for time to event data rolling_origin()
Mone Carlo cross validation mc_cv()
Nested cross validation nested_cv()

Useful functions

rsample2caret()
caret2rsample()
as.data.frame.rsplit()

An example: bootstrapped confidence intervals

Step 1: Create `fit_model()` as a wrapper for `lm()`

fit_model = function(data, formula){
  lm(formula, data)%>%
    tidy()
}

fit_model(cleaned_airbnb_data, 
          formula(price ~ host_is_superhost+
                    host_response_rate+
                    room_type+
                    host_listings_count+
                    number_of_reviews+
                    PC1))

Step 2: Map `fit_model()` onto bootstrap sets

boot_cis = strat_airbnb_boots %>%
  mutate(fitted_models = map(splits, fit_model, 
                             formula = formula(price ~ host_is_superhost+
                                                 host_response_rate+
                                                 room_type+
                                                 host_listings_count+
                                                 number_of_reviews+
                                                 PC1)))
boot_cis

Step 3: Unnest the model fits

boot_cis = boot_cis %>%
  select(-c(splits, id))%>%
  unnest()
boot_cis

Step 4: Select the parameter of interest and find upper and lower quantiles

boot_cis_private_room = boot_cis%>%
  filter(term == "room_typePrivate room")

quantile(boot_cis_private_room$estimate, probs = c(0.025, 0.975))

##       2.5%      97.5% 
## -136.41397  -88.85459

How do the bootstrapped estimates compare to the parametric estimates?

lm(price ~ host_is_superhost+
                  host_response_rate+
                  room_type+
                  host_listings_count+
                  number_of_reviews+
                  PC1,
   data = cleaned_airbnb_data)%>%
  tidy(conf.int = T)%>%
  select(term, conf.low, conf.high)

Intro to tidymodels

Imaani Easthausen

July 16, 2019

What is tidymodels?

Why should I use tidymodels?

Is it actually going to make my life easier?

Some key packages

recipes

Creating a recipe

Step 1: Specify your model and dataset.

Recipe attributes:

Step 2: Specify some data cleaning steps (write the recipe)

Step 3: Actually run the data cleaning steps (prep the recipe ingredients)

Step 4: Return a cleaned and ready to use tibble (bake!)

Some usefull features

Imputation

Other imputation methods:

Selector functions

Other selector functions

Variable roles

Principle component analysis

Other dimensionality reduction methods:

Some other useful steps

step_naomit()

step_BoxCox()

step_corr()

add_step()

Our fully baked recipe

What we accomplished:

parsnip

Syntax

Fit some simple models

Methods implemented in parsnip

A note on survival modeling.

Important features

Making predictions without parsnip :(

Making predictions WITH parsnip :)

Making prediction with parsnip when there are multiple predictions at each observation

An Example: Stratified analysis

Let’s use parsnip and recipes together to conduct a stratified analysis

Using recipes and parsnip for a nested stratified analysis

rsample

Syntax

Creating a bootstrap

Creating a bootstrap with strata

Creating data folds (for cross validation)

Some other re-sampling methods

Useful functions

An example: bootstrapped confidence intervals

Step 1: Create fit_model() as a wrapper for lm()

Step 2: Map fit_model() onto bootstrap sets

Step 3: Unnest the model fits

Step 4: Select the parameter of interest and find upper and lower quantiles

How do the bootstrapped estimates compare to the parametric estimates?

References

`step_naomit()`

`step_BoxCox()`

`step_corr()`

`add_step()`

Methods implemented in `parsnip`

Making predictions without `parsnip` :(

Making predictions WITH `parsnip` :)

Making prediction with `parsnip` when there are multiple predictions at each observation

Let’s use `parsnip` and `recipes` together to conduct a stratified analysis

Using `recipes` and `parsnip` for a nested stratified analysis

Step 1: Create `fit_model()` as a wrapper for `lm()`

Step 2: Map `fit_model()` onto bootstrap sets