tidymodels - Feature engineering

RaukR 2023 • Advanced R for Bioinformatics

Max Kuhn

19-Jun-2023

Previously…

library(tidymodels)
library(doParallel)

tidymodels_prefer()
theme_set(theme_bw())
options(pillar.advice = FALSE, pillar.min_title_chars = Inf)
cl <- makePSOCKcluster(parallel::detectCores(logical = FALSE))
registerDoParallel(cl)

data(cells, package = "modeldata")
cells$case <- NULL

set.seed(123)
cell_split <- initial_split(cells, prop = 0.8, strata = class)
cell_tr <- training(cell_split)
cell_te <- testing(cell_split)

set.seed(123)
cell_rs <- vfold_cv(cell_tr, v = 10, strata = class)

cls_metrics <- metric_set(brier_class, roc_auc, kap)

Working with our predictors

We might want to modify our predictors columns for a few reasons:

  • The model requires them in a different format (e.g. dummy variables for lm()).
  • The model needs certain data qualities (e.g. same units for K-NN).
  • The outcome is better predicted when one or more columns are transformed in some way (a.k.a “feature engineering”).

The first two reasons are fairly predictable (next page).

The last one depends on your modeling problem.

What is feature engineering?

Think of a feature as some representation of a predictor that will be used in a model.

Example representations:

  • Interactions
  • Polynomial expansions/splines
  • PCA feature extraction

There are a lot of examples in Feature Engineering and Selection.

Example: Dates

How can we represent date columns for our model?

When a date column is used in its native format, it is usually converted by an R model to an integer.

It can be re-engineered as:

  • Days since a reference date
  • Day of the week
  • Month
  • Year
  • Indicators for holidays

General definitions

  • Data preprocessing steps allow your model to fit.

  • Feature engineering steps help the model do the least work to predict the outcome as well as possible.

The recipes package can handle both!

In a little bit, we’ll see successful (and unsuccessful) feature engineering methods for our example data.

Prepare your data for modeling

  • The recipes package is an extensible framework for pipeable sequences of feature engineering steps that provide preprocessing tools to be applied to data.
  • Statistical parameters for the steps can be estimated from an initial data set and then applied to other data sets.
  • The resulting processed output can be used as inputs for statistical or machine learning models.

A first recipe

cell_rec <- 
  recipe(class ~ ., data = cell_tr) 
  • The recipe() function assigns columns to roles of “outcome” or “predictor” using the formula

A first recipe

summary(cell_rec)
#> # A tibble: 57 × 4
#>    variable                     type      role      source  
#>    <chr>                        <list>    <chr>     <chr>   
#>  1 angle_ch_1                   <chr [2]> predictor original
#>  2 area_ch_1                    <chr [2]> predictor original
#>  3 avg_inten_ch_1               <chr [2]> predictor original
#>  4 avg_inten_ch_2               <chr [2]> predictor original
#>  5 avg_inten_ch_3               <chr [2]> predictor original
#>  6 avg_inten_ch_4               <chr [2]> predictor original
#>  7 convex_hull_area_ratio_ch_1  <chr [2]> predictor original
#>  8 convex_hull_perim_ratio_ch_1 <chr [2]> predictor original
#>  9 diff_inten_density_ch_1      <chr [2]> predictor original
#> 10 diff_inten_density_ch_3      <chr [2]> predictor original
#> # ℹ 47 more rows

Transforming individual predictors

cell_rec <- 
  recipe(class ~ ., data = cell_tr) %>% 
  step_YeoJohnson(all_predictors())

The YJ transformation can be used to produce more symmetric distirbutions for predictors. It is very similar to the Box-Cox transformation.

Standardize predictors

pca_rec <- 
  recipe(class ~ ., data = cell_tr) %>% 
  step_YeoJohnson(all_predictors()) %>% 
  step_normalize(all_predictors())
  • This centers and scales the numeric predictors.

  • The recipe will use the training set to estimate the means and standard deviations of the data.

  • All data the recipe is applied to will be normalized using those statistics (there is no re-estimation).

Convert the data to PCA components

pca_rec <- 
  recipe(class ~ ., data = cell_tr) %>% 
  step_YeoJohnson(all_predictors()) %>% 
  step_normalize(all_predictors()) %>% 
  step_pca(all_predictors(), num_comp = 10)

Convert the data to PLS components

pca_rec <- 
  recipe(class ~ ., data = cell_tr) %>% 
  step_YeoJohnson(all_predictors()) %>% 
  step_normalize(all_predictors()) %>% 
  step_pls(all_predictors(), outcome = vars(class), num_comp = 10)

Since PLS is supervised, we have to use the outcome argument.

Reduce correlation

filter_rec <- 
  recipe(class ~ ., data = cell_tr) %>% 
  step_YeoJohnson(all_predictors()) %>% 
  step_corr(all_numeric_predictors(), threshold = 0.9)

To deal with highly correlated predictors, find the minimum set of predictor columns that make the pairwise correlations less than the threshold.

Using a workflow

cell_pca_wflow <-
  workflow() %>%
  add_recipe(pca_rec) %>%
  add_model(logistic_reg())
 
ctrl <- control_resamples(save_pred = TRUE)

set.seed(9)
cell_glm_res <-
  cell_pca_wflow %>%
  fit_resamples(cell_rs, control = ctrl, metrics = cls_metrics)

collect_metrics(cell_glm_res)
#> # A tibble: 3 × 6
#>   .metric     .estimator  mean     n std_err .config             
#>   <chr>       <chr>      <dbl> <int>   <dbl> <chr>               
#> 1 brier_class binary     0.138    10 0.00443 Preprocessor1_Model1
#> 2 kap         binary     0.559    10 0.0171  Preprocessor1_Model1
#> 3 roc_auc     binary     0.871    10 0.00877 Preprocessor1_Model1

Recipes are estimated

Preprocessing steps in a recipe use the training set to compute quantities.

What kind of quantities are computed for preprocessing?

  • Levels of a factor
  • Whether a column has zero variance
  • Normalization
  • Feature extraction
  • Effect encodings

When a recipe is part of a workflow, this estimation occurs when fit() is called.

The recipe is estimated within each resample.

Getting specific results

cell_pca_fit <-
  cell_pca_wflow %>% 
  fit(data = cell_tr)

cell_pca_fit %>% 
  extract_recipe() %>% 
  tidy(number = 1)
#> # A tibble: 52 × 3
#>    terms                    value id              
#>    <chr>                    <dbl> <chr>           
#>  1 angle_ch_1               0.787 YeoJohnson_J3XdN
#>  2 area_ch_1               -0.923 YeoJohnson_J3XdN
#>  3 avg_inten_ch_1          -0.337 YeoJohnson_J3XdN
#>  4 avg_inten_ch_2           0.425 YeoJohnson_J3XdN
#>  5 avg_inten_ch_3           0.200 YeoJohnson_J3XdN
#>  6 avg_inten_ch_4           0.220 YeoJohnson_J3XdN
#>  7 diff_inten_density_ch_1 -0.937 YeoJohnson_J3XdN
#>  8 diff_inten_density_ch_3  0.103 YeoJohnson_J3XdN
#>  9 diff_inten_density_ch_4  0.123 YeoJohnson_J3XdN
#> 10 entropy_inten_ch_1      -0.440 YeoJohnson_J3XdN
#> # ℹ 42 more rows
cell_pca_fit %>% 
  extract_fit_parsnip() %>% 
  tidy()
#> # A tibble: 11 × 5
#>    term        estimate std.error statistic  p.value
#>    <chr>          <dbl>     <dbl>     <dbl>    <dbl>
#>  1 (Intercept)  -1.08      0.0804   -13.4   9.76e-41
#>  2 PC01          0.426     0.0245    17.4   6.02e-68
#>  3 PC02          0.202     0.0228     8.85  8.61e-19
#>  4 PC03          0.362     0.0277    13.1   4.44e-39
#>  5 PC04         -0.103     0.0334    -3.07  2.16e- 3
#>  6 PC05         -0.242     0.0422    -5.72  1.04e- 8
#>  7 PC06         -0.145     0.0443    -3.27  1.08e- 3
#>  8 PC07          0.132     0.0539     2.45  1.42e- 2
#>  9 PC08         -0.0348    0.0499    -0.699 4.85e- 1
#> 10 PC09         -0.0438    0.0610    -0.718 4.72e- 1
#> 11 PC10          0.104     0.0676     1.53  1.25e- 1

Debugging a recipe

  • Typically, you will want to use a workflow to estimate and apply a recipe.
  • If you have an error and need to debug your recipe, the original recipe object (e.g. pca_rec) can be estimated manually with a function called prep(). It is analogous to fit(). See TMwR section 16.4
  • Another function (bake()) is analogous to predict(), and gives you the processed data back.
  • The tidy() function can be used to get specific results from the recipe.

More on recipes

  • Once fit() is called on a workflow, changing the model does not re-fit the recipe.
  • Some steps can be skipped when using predict().
  • The order of the steps matters.