Introduction to Biostatistics and Machine Learning

22th - 26th April 2024

Welcome

Introduction to Biostatistics and Machine Learning



  • 22th - 26th April 2024
  • BMC, Husargatan 3, Uppsala
  • Room: Trippelrummet: E10:1307/8/9

Image by Alex Knigth

About us







What about you?

What about you?

What about you?

What about you?

What about you?



First Name University
Bangzhuo Uppsala Universitet
Elzbieta Naturhistoriska Riksmuseet
Anusha Thakkutte Uppsala Universitet
Sara Umeå Universitet
Yassine Uppsala Universitet
Maximilian Kungliga Tekniska Högskolan, KTH
Stefan Chalmers
Linnéa Uppsala Universitet
Yonglong Karolinska Institutet
Subhamita Umeå Universitet
First Name University
Erik Chalmers
Agata Chalmers
Vishnu Umeå Universitet
Sheng Leslie Luleå university of technology
Anna Radboud University, Nijmegen, NL
Raphaël Kungliga Tekniska Högskolan, KTH
Gustav Karolinska Institutet
Nils Sveriges lantbruksuniversitet, SLU
Yuan Lund Universitet

Practicalities

Room



  • same room entire week
  • please drink coffee and eat snacks outside
  • bathrooms locations
  • no access cards to the building (apologies!)
  • we will lock the room when going for lunch
  • lunch: Monday, Tuesday, Thursday and Friday at Bikupan
  • lunch: Wednesday at Sven Dufva

Image by Andrea de Santis

Internet



  • Eduroam
  • WiFi network UU-Guest

Course website



Canvas Demo



  • Schedule
  • Modules
  • Chat
  • Discussion
  • Files
  • Announcements
  • Quiz

please change your name under the Profile



Certificate requirements



  1. presence in all sessions during the week
    • we may allow skipping up to 4h during the week
  1. completing “Daily challenge” quiz
    • opens daily at 15.00
    • closes at 09:00 the following day
  1. active participation during the week

Note that we are not able to provide any formal university credits (högskolepoäng). Many universities, however, recognize the attendance in our courses, and award 1.5 HPs, corresponding to 40h of studying. It is up to participants to clarify and arrange credit transfer with the relevant university department.



Payam’s Active Participation rule



  1. there are no wrong answers
  1. except: “I do not know” or its equivalents
  1. so when asked a question, say something because 1.

Image by @thisisengineering

Help us building a stimulating learning environment by actively participating. We love when you talk and ask questions!

Not enough time



There is (probably) not enough time during the course to complete all the exercises.

  • It may be good to prioritize.
  • We can schedule a follow-up online session.
  • We will come back to this on Friday.

generated by DALL \(\cdot\) E 2

Course: background & aim

Background



“If you torture the data long enough, it will confess to anything”

Ronald Coase, British Economist

image by Richa Bhatia

Background



Some common problems we have been observing are due to:

  • incorrect study design e.g. absence of adequate controls
  • forming incorrectly null and alternative hypothesis
  • applying statistical methods without understanding
  • misinterpreting the output of the statistical methods
  • circular analysis, e.g. testing hypotheses on same sample that led to the generation of the hypotheses in the first place

image by Richa Bhatia

Aim



We aim to focus on fundamentals since we believe that getting the basics right will help with:

  • avoiding common errors &
  • studying more advanced topics.

Aim

What do we want you to gain from this course?

  • Framework for statistical learning: learn a selection of method & develop an understanding to apply methods correctly and explore them independently
  • Appreciation of theory incl. not being afraid of equations

How?

  • We will look in detail into selected methods
  • Over the week we will build up our understanding of different aspects of a typical data analysis project and demonstrate how to combine all the steps to build a predictive model.

Course Content

Content



Let’s consider some common cases…

Content



Case 1: Mouse Knockout Experiment

Content: Mouse Knockout Experiment

Case 1: Mouse Knockout Experiment

Does LDL receptor gene affect the total plasma cholesterol?



Setup

We have access to data from an experiment:

  • 10 wild type (WT) mice
  • 10 LDLR knockout (KO) mice

where plasma concentration of total cholesterol was measured at one time point after feeding on high fat diet

generated by DALL \(\cdot\) E 2

Content: Mouse Knockout Experiment



Visualize results

  • Not so informative.
  • Let’s improve!

Content: Mouse Knockout Experiment



Improved visualization

  • Collect KO and WT separately as columns in a box plot
  • Visualize distribution of sample values using descriptive statistics such as median and quartiles

We will talk more:

  • exploratory data analysis
  • summarizing data with descriptive statistics
  • visualizing data

Content: Mouse Knockout Experiment



Interpretation

  • Intuitively, there is a clear difference between KO and WT based on the box plot

Content: Mouse Knockout Experiment



Interpretation

  • Intuitively, there is no difference between KO and WT based on the box plot

Content: Mouse Knockout Experiment



Interpretation

  • And now we get uncertain…
  • Probability gives us a scale for measuring uncertainty
  • Probability theory is fundamentally important to inferential statistical analysis

We will talk more:

  • probability theory
  • discrete and continuous random variables

Content: Mouse Knockout Experiment



Inference

  • Formulate null \(H_0\) and \(H_a\) hypothesis:

    • \(H_0\): \(\mu_1 = \mu_2\)
    • \(H_a\): \(\mu_1 \neq \mu_2\)
  • Select appropriate test statistics \(T\) and calculate corresponding p-value
  • Draw conclusions whether there is enough evidence of rejecting \(H_0\)

We will talk more:

  • statistical tests
  • using permutations, parametric test and non-parametric test
  • multiple testing

We’ll learn that statistical test are like criminal trials

Two possible true states:

  1. Defendant committed the crime
  2. Defendant did not commit the crime

Two possible verdicts: guilty or not guilty

Initially the defendant is assumed to be not guilty!

  • Prosecution must present evidence “beyond reasonable doubt” for a guilty verdict
  • We never prove that someone is not guilty.

Same with statistical tests:

  • we can reject \(H_0\) hypothesis if there is enough evidence given the data
  • or we conclude that there is not enough evidence to reject \(H_0\)

Content



Case 2: Protein expression

Content: Protein expression

Case 2: Protein expression

Is there a relationship between BRCA1 protein expression and mRNA expression in breast tissue?



Setup

We have access to data from a breast cancer study:

  • BRCA1 protein expression based on immunolohistochemical staining
  • mRNA expression from RNA-seq

for 10 000 study participants

Content: Protein expression

Case 2: Protein expression

Is there a relationship between BRCA1 protein expression and mRNA expression in breast tissue?



We will talk more:

  • fitting linear model: \([Prot] = \alpha +\beta [mRNA] + \epsilon\)

Content: Protein expression

Case 2: Protein expression

Is there a relationship between BRCA1 protein expression and mRNA expression in breast tissue?



We will talk more:

  • fitting linear model: \([Prot] = \alpha +\beta [mRNA] + \epsilon\)
  • hypothesis testing

Content: Protein expression

Case 2: Protein expression

Is there a relationship between BRCA1 protein expression and mRNA expression in breast tissue?



We will talk more:

  • fitting linear model: \([Prot] = \alpha +\beta [mRNA] + \epsilon\)
  • hypothesis testing
  • using model for predictions

Content: Protein expression

Case 2: Protein expression

Is there a relationship between BRCA1 protein expression and mRNA expression in breast tissue?



We will talk more:

  • fitting linear model: \([Prot] = \alpha +\beta [mRNA] + \epsilon\)
  • hypothesis testing
  • using model for predictions
  • expanding linear model to logistic regression with GLM, generalized linear models

Content: Protein expression

Case 2: Protein expression

Is there a relationship between BRCA1 protein expression and mRNA expression in breast tissue?



We will talk more:

  • fitting linear model: \([Prot] = \alpha +\beta [mRNA] + \epsilon\)
  • hypothesis testing
  • using model for predictions
  • expanding linear model to logistic regression with GLM, generalized linear models
  • multivariate regression: \([Prot] = \alpha +\beta_1[mRNA] + \beta_2[age] + \cdots + \epsilon\)

Content



Case 3: predicting depression

Content: predicting depression

Case 3: predicting depression

Can we predict depression based on the methylation, and if so, methylation in which regions can be linked with depression?



Setup

We have access to data from a depression study:

  • methylation measurements for the > 850 000 sites (EPIC array)
  • clinical data

for the 100 participants diagnosed with major depressive disorder and 50 healthy controls

Content: predicting depression

Case 3: predicting depression

Can we predict depression based on the methylation, and if so, methylation in which regions can be linked with depression?



We will talk more:

  • dimensionality reduction

Content: predicting depression

Case 3: predicting depression

Can we predict depression based on the methylation, and if so, methylation in which regions can be linked with depression?



We will talk more:

  • dimensionality reduction
  • clustering

Content: predicting depression

Case 3: predicting depression

Can we predict depression based on the methylation, and if so, methylation in which regions can be linked with depression?



We will talk more:

  • dimensionality reduction
  • clustering
  • supervised learning
flowchart TD
  A(Data) --> B(Data splitting)
  B --> C(Feature engineering & selection)
  C --> D[Choosing ML algorithms]
  D --> E[Tuning & evaluating]
  E --> F[Final prediction model]
  E --> G[Top ranked features]

Content: predicting depression

Case 3: predicting depression

Can we predict depression based on the methylation, and if so, methylation in which regions can be linked with depression?



We will talk more:

  • dimensionality reduction
  • clustering
  • supervised learning
    • regularization, Random Forest
    • overfitting
    • tidymodels
flowchart TD
  A(Data) --> B(Data splitting)
  B --> C(Feature engineering & selection)
  C --> D[Choosing ML algorithms]
  D --> E[Tuning & evaluating]
  E --> F[Final prediction model]
  E --> G[Top ranked features]
  

Content



Monday

  • descriptive statistics
  • feature engineering
  • probability theory

Tuesday

  • inference statistics

Wednesday

  • supervised learning
  • linear models
  • feature selection

Thursday

  • dimensionality reduction
  • clustering


Friday

  • ML pipeline with tidymodels
  • Random Forest

Content



Presentations vs. chapters

Welcome

Questions?

Group discussion

Group discussion



  1. How did you find pre-course math foundations? Try to help each other out if there are any questions.

  2. What is the main thing you’re interested in learning this week? Can you agree on one priority topic per group?

Questions?