Content
Let’s consider some common cases…
Content
Case 1: Mouse Knockout Experiment
Content: Mouse Knockout Experiment
Case 1: Mouse Knockout Experiment
Does LDL receptor gene affect the total plasma cholesterol?
Setup
We have access to data from an experiment:
- 10 wild type (WT) mice
- 10 LDLR knockout (KO) mice
where plasma concentration of total cholesterol was measured at one time point after feeding on high fat diet
Content: Mouse Knockout Experiment
Visualize results
Content: Mouse Knockout Experiment
Improved visualization
- Collect KO and WT separately as columns in a box plot
- Visualize distribution of sample values using descriptive statistics such as median and quartiles
We will talk more:
- exploratory data analysis
- summarizing data with descriptive statistics
- visualizing data
Content: Mouse Knockout Experiment
Interpretation
- Intuitively, there is a clear difference between KO and WT based on the box plot
Content: Mouse Knockout Experiment
Interpretation
- Intuitively, there is no difference between KO and WT based on the box plot
Content: Mouse Knockout Experiment
Interpretation
- And now we get uncertain…
- Probability gives us a scale for measuring uncertainty
- Probability theory is fundamentally important to inferential statistical analysis
We will talk more:
- probability theory
- discrete and continuous random variables
Content: Mouse Knockout Experiment
Inference
- Select appropriate test statistics \(T\) and calculate corresponding p-value
- Draw conclusions whether there is enough evidence of rejecting \(H_0\)
We will talk more:
- statistical tests
- using permutations, parametric test and non-parametric test
- multiple testing
We’ll learn that statistical test are like criminal trials
Two possible true states:
- Defendant committed the crime
- Defendant did not commit the crime
Two possible verdicts: guilty or not guilty
Initially the defendant is assumed to be not guilty!
- Prosecution must present evidence “beyond reasonable doubt” for a guilty verdict
- We never prove that someone is not guilty.
Same with statistical tests:
- we can reject \(H_0\) hypothesis if there is enough evidence given the data
- or we conclude that there is not enough evidence to reject \(H_0\)
Content
Case 2: Protein expression
Content: Protein expression
Case 2: Protein expression
Is there a relationship between BRCA1 protein expression and mRNA expression in breast tissue?
Setup
We have access to data from a breast cancer study:
- BRCA1 protein expression based on immunolohistochemical staining
- mRNA expression from RNA-seq
for 10 000 study participants
Content: Protein expression
Case 2: Protein expression
Is there a relationship between BRCA1 protein expression and mRNA expression in breast tissue?
We will talk more:
- fitting linear model: \([Prot] = \alpha +\beta [mRNA] + \epsilon\)
Content: Protein expression
Case 2: Protein expression
Is there a relationship between BRCA1 protein expression and mRNA expression in breast tissue?
We will talk more:
- fitting linear model: \([Prot] = \alpha +\beta [mRNA] + \epsilon\)
- hypothesis testing
Content: Protein expression
Case 2: Protein expression
Is there a relationship between BRCA1 protein expression and mRNA expression in breast tissue?
We will talk more:
- fitting linear model: \([Prot] = \alpha +\beta [mRNA] + \epsilon\)
- hypothesis testing
- using model for predictions
Content: Protein expression
Case 2: Protein expression
Is there a relationship between BRCA1 protein expression and mRNA expression in breast tissue?
We will talk more:
- fitting linear model: \([Prot] = \alpha +\beta [mRNA] + \epsilon\)
- hypothesis testing
- using model for predictions
- expanding linear model to logistic regression with GLM, generalized linear models
Content: Protein expression
Case 2: Protein expression
Is there a relationship between BRCA1 protein expression and mRNA expression in breast tissue?
We will talk more:
- fitting linear model: \([Prot] = \alpha +\beta [mRNA] + \epsilon\)
- hypothesis testing
- using model for predictions
- expanding linear model to logistic regression with GLM, generalized linear models
- multivariate regression: \([Prot] = \alpha +\beta_1[mRNA] + \beta_2[age] + \cdots + \epsilon\)
Content
Case 3: predicting depression
Content: predicting depression
Case 3: predicting depression
Can we predict depression based on the methylation, and if so, methylation in which regions can be linked with depression?
Setup
We have access to data from a depression study:
- methylation measurements for the > 850 000 sites (EPIC array)
- clinical data
for the 100 participants diagnosed with major depressive disorder and 50 healthy controls
Content: predicting depression
Case 3: predicting depression
Can we predict depression based on the methylation, and if so, methylation in which regions can be linked with depression?
Content: predicting depression
Case 3: predicting depression
Can we predict depression based on the methylation, and if so, methylation in which regions can be linked with depression?
We will talk more:
- dimensionality reduction
- clustering
Content: predicting depression
Case 3: predicting depression
Can we predict depression based on the methylation, and if so, methylation in which regions can be linked with depression?
We will talk more:
- dimensionality reduction
- clustering
- supervised learning
flowchart TD
A(Data) --> B(Data splitting)
B --> C(Feature engineering & selection)
C --> D[Choosing ML algorithms]
D --> E[Tuning & evaluating]
E --> F[Final prediction model]
E --> G[Top ranked features]
Content: predicting depression
Case 3: predicting depression
Can we predict depression based on the methylation, and if so, methylation in which regions can be linked with depression?
We will talk more:
- dimensionality reduction
- clustering
- supervised learning
- regularization, Random Forest
- overfitting
- tidymodels
flowchart TD
A(Data) --> B(Data splitting)
B --> C(Feature engineering & selection)
C --> D[Choosing ML algorithms]
D --> E[Tuning & evaluating]
E --> F[Final prediction model]
E --> G[Top ranked features]
Content
Monday
- descriptive statistics
- feature engineering
- probability theory
Tuesday
Wednesday
- supervised learning
- linear models
- feature selection
Thursday
- dimensionality reduction
- clustering
Friday
- ML pipeline with tidymodels
- Random Forest
Content
Presentations vs. chapters
Welcome