Putting everything together

Machine learning pipelines


  • Throughout the course, we have seen steps that are common in machine learning workflows, such as data cleaning, feature selection, data splitting, model training, tuning, and evaluation.

  • It is valuable to know how to code each step manually, using basic functions or selected R packages.

  • An alternative approach is to follow a structured pipeline using established frameworks.

Base R vs. Frameworks


🛠️ Base R / Custom Code

Pros
- Full control over each step
- Deep understanding of the process
- Flexible for non-standard workflows
- Easier to debug and customize

Cons
- More code to write
- Harder to maintain
- Manual error checking
- Less reproducible

⚙️ Framework

Pros
- Faster prototyping
- Cleaner, modular syntax
- Consistent and reproducible pipelines
- Easier collaboration and sharing

Cons
- Less transparency (black-box risk)
- Steeper learning curve at first
- May feel restrictive for custom tasks

ML frameworks


  • To help streamline ML process, several frameworks have been developed in R.
  • One of the earlier initiatives to create a framwork for ML tasks in R was the caret package, led by Max Kuhn.
  • caret (2007) unified many modeling tools and was widely-used framework that provided tools for preprocessing, resampling, and cross-validation.
  • Building on this foundation, Kuhn partnered with Hadley Wickham, the creator of the `tidyverse.
  • Tidymodels were launched in 2020, as a modern, modular collection of R packages that applies tidyverse principles to make machine learning workflows more intuitive, readable, and consistent.
  • Other mentions: mlr3 offers a highly modular, object-oriented design suited for advanced tasks like benchmarking and custom pipelines. For deep learning, torch and its high-level interface luz bring native PyTorch support to R.

Tidymodels


Tidymodels is a collection of packages for modeling and statistical analysis in R.

  • Unified Framework: a suite of packages that share underlying design philosophies designed to streamline ML tasks.

  • Extensible and Flexible: allows users to easily integrate with other R packages and frameworks; supports a wide range of methods.

  • Emphasis on Tidy Data Principles: The framework adheres to the principles of “tidy data” set by the tidyverse, ensuring that data manipulation and analysis tasks are approachable and intuitive.

Minimum example


Let’s try to build a predictive model for BMI using our diabetes data set using basic R approach and/or tidymodels framework.