Introduction to Biostatistics and Machine Learning

7th - 11th April 2025

Welcome

Introduction to Biostatistics and Machine Learning

7th - 11th April 2025
BMC, Husargatan 3, Uppsala
Room: Trippelrummet: E10:1307/8/9

About us

NBIS: National Bioinformatics Infrastructure Sweden
SciLifeLab Bioinformatics Platform
https://nbis.se
https://www.scilifelab.se/units/nbis/

What about you?

First Name	University
Subazini	Göteborgs Universitet
Andreas	Uppsala Universitet
Ilona	Linköpings Universitet
Luca	Karolinska Institutet
Rabia	Umeå Universitet
María José	Karolinska Institutet
Nazifa Nawal	Umeå Universitet
Elena Flavia	Karolinska Institutet
Adrian	Karolinska Institutet
Sena Gizem	Umeå Universitet

First Name	University
Andrea	Uppsala Universitet
Charlotta	Stockholms Universitet
Theodora	Region Stockholm, Akademiskt Specialistcentrum
Stefan	Umeå Universitet
Jakob	Uppsala Universitet
Jonne	Uppsala Universitet
Preeti	Umeå Universitet

Practicalities

Room

same room entire week
please drink coffee and eat snacks outside
three exits
bathrooms locations
no access cards to the building (apologies!)
we will lock the room when going for lunch
lunch: Monday, Tuesday, Wednesday and Friday at Bikupan
lunch: Thursday at Sven Dufva
coffee and sandwich/fika at 8.45 and 14.30
Breaks

Internet

Eduroam
WiFi network UU-Guest

Course website

https://uppsala.instructure.com/courses/98820
you should be now enrolled and be able to access materials incl. quizzes

Canvas Demo

Schedule
Quiz

Certificate requirements

presence in all sessions during the week
- we may allow skipping up to 4h during the week

completing “Daily challenge” quiz
- opens daily at 15.00
- closes at 09:00 the following day

active participation during the week

Note that we are not able to provide any formal university credits (högskolepoäng). Many universities, however, recognize the attendance in our courses, and award 1.5 HPs, corresponding to 40h of studying. It is up to participants to clarify and arrange credit transfer with the relevant university department.

Course: background & aim

Background

“If you torture the data long enough, it will confess to anything”

Ronald Coase, British Economist

Background

Some common problems we have been observing are due to:

incorrect study design e.g. absence of adequate controls
forming incorrect null and alternative hypothesis
applying statistical methods without understanding
misinterpreting the output of the statistical methods
circular analysis, e.g. testing hypotheses on same sample that led to the generation of the hypotheses in the first place

Aim

We aim to focus on fundamentals since we believe that getting the basics right will help with:

avoiding common errors
studying more advanced topics.

Aim

What do we want you to gain from this course?

Framework for statistical learning: learn a selection of methods and develop an understanding to apply methods correctly and explore them independently

Appreciation of theory incl. not being afraid of equations

How?

We will look in detail into selected methods

Over the week we will build up our understanding of different aspects of a typical data analysis project and demonstrate how to combine all the steps to build a predictive model.

Course Content

Content

Let’s consider some common cases…

Content

Case 1: Mouse Knockout Experiment

Content: Mouse Knockout Experiment

Case 1: Mouse Knockout Experiment

Does LDL receptor gene affect the total plasma cholesterol?

Setup

We have access to data from an experiment:

10 wild type (WT) mice
10 LDLR knockout (KO) mice

where plasma concentration of total cholesterol was measured at one time point after feeding on high fat diet

Content: Mouse Knockout Experiment

Visualize results

Not so informative.

Let’s improve!

Content: Mouse Knockout Experiment

Improved visualization

Collect KO and WT separately as columns in a box plot

Visualize distribution of sample values using descriptive statistics such as median and quartiles

We will talk more:
exploratory data analysis
summarizing data with descriptive statistics
visualizing data

Content: Mouse Knockout Experiment

Interpretation

Intuitively, there is a clear difference between KO and WT based on the box plot

Content: Mouse Knockout Experiment

Interpretation

Intuitively, there is no difference between KO and WT based on the box plot

Content: Mouse Knockout Experiment

Interpretation

And now we get uncertain…

Probability gives us a scale for measuring uncertainty

Probability theory is fundamentally important to inferential statistical analysis

We will talk more:
probability theory
discrete and continuous random variables

Content: Mouse Knockout Experiment

Inference

Formulate null \(H_0\) and \(H_a\) hypothesis:
- \(H_0\): \(\mu_1 = \mu_2\)
- \(H_a\): \(\mu_1 \neq \mu_2\)

Select appropriate test statistics \(T\) and calculate corresponding p-value

Draw conclusions whether there is enough evidence of rejecting \(H_0\)

We will talk more:
statistical tests
using permutations, parametric test and non-parametric test
multiple testing

Content

Case 2: Protein expression

Content: Protein expression

Case 2: Protein expression

Is there a relationship between BRCA1 protein expression and mRNA expression in breast tissue?

Setup

We have access to data from a breast cancer study:

BRCA1 protein expression based on immunohistochemical staining
mRNA expression from RNA-seq

for 10 000 study participants

Content: Protein expression

Case 2: Protein expression

Is there a relationship between BRCA1 protein expression and mRNA expression in breast tissue?

We will talk more:

fitting linear model: \([Prot] = \alpha +\beta [mRNA] + \epsilon\)

Content: Protein expression

Case 2: Protein expression

Is there a relationship between BRCA1 protein expression and mRNA expression in breast tissue?

We will talk more:

fitting linear model: \([Prot] = \alpha +\beta [mRNA] + \epsilon\)
hypothesis testing

Content: Protein expression

Case 2: Protein expression

Is there a relationship between BRCA1 protein expression and mRNA expression in breast tissue?

We will talk more:

fitting linear model: \([Prot] = \alpha +\beta [mRNA] + \epsilon\)
hypothesis testing
using model for predictions

Content: Protein expression

Case 2: Protein expression

Is there a relationship between BRCA1 protein expression and mRNA expression in breast tissue?

We will talk more:

fitting linear model: \([Prot] = \alpha +\beta [mRNA] + \epsilon\)
hypothesis testing
using model for predictions
expanding linear model to logistic regression with GLM, generalized linear models

Content: Protein expression

Case 2: Protein expression

Is there a relationship between BRCA1 protein expression and mRNA expression in breast tissue?

We will talk more:
fitting linear model: \([Prot] = \alpha +\beta [mRNA] + \epsilon\)
hypothesis testing
using model for predictions
expanding linear model to logistic regression with GLM, generalized linear models
multivariate regression: \([Prot] = \alpha +\beta_1[mRNA] + \beta_2[age] + \cdots + \epsilon\)

Content

Case 3: predicting depression

Content: predicting depression

Case 3: predicting depression

Can we predict depression based on the methylation, and if so, methylation in which regions can be linked with depression?

Setup

We have access to data from a depression study:

methylation measurements for the > 850 000 sites (EPIC array)
clinical data

for the 100 participants diagnosed with major depressive disorder and 50 healthy controls

Content: predicting depression

Case 3: predicting depression

Can we predict depression based on the methylation, and if so, methylation in which regions can be linked with depression?

We will talk more:

dimensionality reduction

Content: predicting depression

Case 3: predicting depression

Can we predict depression based on the methylation, and if so, methylation in which regions can be linked with depression?

We will talk more:

dimensionality reduction
clustering

Content: predicting depression

Case 3: predicting depression

Can we predict depression based on the methylation, and if so, methylation in which regions can be linked with depression?

We will talk more:

dimensionality reduction
clustering
supervised learning

flowchart TD
  A(Data) --> B(Data splitting)
  B --> C(Feature engineering & selection)
  C --> D[Choosing ML algorithms]
  D --> E[Tuning & evaluating]
  E --> F[Final prediction model]
  E --> G[Top ranked features]

Content: predicting depression

Case 3: predicting depression

Can we predict depression based on the methylation, and if so, methylation in which regions can be linked with depression?

We will talk more:
dimensionality reduction
clustering
supervised learning
regularization, Random Forest
overfitting

flowchart TD
  A(Data) --> B(Data splitting)
  B --> C(Feature engineering & selection)
  C --> D[Choosing ML algorithms]
  D --> E[Tuning & evaluating]
  E --> F[Final prediction model]
  E --> G[Top ranked features]

Content

Monday

descriptive statistics
feature engineering
probability theory

Tuesday

inferential statistics

Wednesday

supervised learning
linear models
feature selection

Thursday

dimensionality reduction
clustering

Friday

ML pipeline
Random Forest

Content

Presentations vs. chapters

Questions?

Group discussion

How did you find pre-course math foundations? Try to help each other out if there are any questions.
What is the main thing you’re interested in learning this week? Can you agree on one priority topic per group?