class: center, middle, inverse, title-slide .title[ # Brief introduction to statistics ] .subtitle[ ## Statistics ] .author[ ### Nima Rafati ] --- exclude: true count: false <link href="https://fonts.googleapis.com/css?family=Roboto|Source+Sans+Pro:300,400,600|Ubuntu+Mono&subset=latin-ext" rel="stylesheet"> <link rel="stylesheet" href="https://use.fontawesome.com/releases/v5.3.1/css/all.css" integrity="sha384-mzrmE5qonljUremFsqc01SB46JvROS7bZs3IO2EmfFsd15uHvIt+Y8vEf7N7fWAU" crossorigin="anonymous"> <!-- ------------ Only edit title, subtitle & author above this ------------ --> --- name: intro # Introduction **Why do we need statistics in our analysis?** - Make data understandable and insightful. - Evaluate patterns and trends. - Identify and quantify differences/similarities between groups. -- **Types of statistics:** - Descriptive statistics: To summarize and describe main features of a dataset (Mean, median,...). - Inferential statistics: To make prediction or inferences about a population using a sample of data (Hypothesis testing, regression analysis,...). - Predictive statistics: To make predictions about future outcomes based on collected data (Regression models, time series forecasting, machine learning,...). - ...... --- name: Descriptive # Types of Descriptive Statistics Descriptive statistics helps to: - Summarize and describe the data. - Visualize the data. - Identify patterns (trends) and outliers in the data. - Provide insights for downstream-analysis. --- name: SomeStats # Some of the basic descriptive statistics 1. **Measures of Central Tendency** - Mean, Median, Mode. 2. **Measures of Spread** - Range, Interquartile Range, Standard Deviation, Variance. 3. **Correlation** - Relation between two variables (e.g. Pearson's correlation). --- name: Mean # Central Tendency: Mean - Mean: The average value of data. $$ \bar{x} = \frac{1}{n} \sum_{i=1}^{n} x_i $$ <img src="slide_r_basic_statistic_files/figure-html/Mean-1.png" width="720" style="display: block; margin: auto auto auto 0;" /> ``` r mean(data$var1) mean(data$var2) ``` ``` ## [1] 47.46736 ## [1] 100.081 ``` --- name: Median # Central Tendency: Median - Median: The middle value with the data is sorted. <img src="slide_r_basic_statistic_files/figure-html/Median-1.png" width="720" style="display: block; margin: auto auto auto 0;" /> ``` r median(data$var1) median(data$var2) ``` ``` ## [1] 40.06529 ## [1] 100.0426 ``` --- name: Mode # Central Tendency: Mode - Mode: The most frequently occurring value. <img src="slide_r_basic_statistic_files/figure-html/Mode-plot-1.png" width="720" style="display: block; margin: auto auto auto 0;" /> ``` r density_data1 <- density(data$var1) density_data1$x[which.max(density_data1$y)] density_data2 <- density(data$var2) density_data2$x[which.max(density_data2$y)] ``` ``` ## [1] 26.8621 ## [1] 100.6456 ``` --- name: Spread # Measures of spread: Range and Interquartile Range. - Range: Difference between maximum `max(data$var2)` and minimum `min(data$var2)`. - Interquartile Range: Data is represented in four equally sized groups (bins) known as **Quartile** and the distance between quartile is called **Interquartile Range** (IQR). <img src="slide_r_basic_statistic_files/figure-html/range-1.png" width="504" style="display: block; margin: auto auto auto 0;" /> --- name: Variance # Measures of spread: Variance - Variance: How far the data points are spread out from the mean. Unit is the square of the data's unit (e.g. `\(cm^2\)` ). $$ \sigma^2 = \frac{1}{n} \sum_{i=1}^{n} (x_i - \bar{x})^2 $$ ``` r var(data$var2) ``` ``` ## [1] 395.8689 ``` --- name: Stdev # Measures of spread: Standard deviation - Standard deviation (sd): is the square root of the variance and provides a more intuitive measure of spread. Despite of variance, sd has the same unit as the data (e.g. cm). $$ \sigma =\sqrt{\sigma^2} $$ <img src="slide_r_basic_statistic_files/figure-html/sd-plot-1.png" width="504" style="display: block; margin: auto auto auto 0;" /> --- name: correlation # Correlation - Measuring the strength and direction of the **linear** relationship between two variables. - Positive Correlation: As one variable increases, the other also increases. - Negative Correlation: As one variable increases, the other decreases. - No Correlation: No directional relationship between the variables. --- name: Pearson # Types of correlation - Pearson's correlation coefficient: Correlation of two **continuous** variables. - Assumptions: - Linear relationship. - Normally distributed variables. <img src="slide_r_basic_statistic_files/figure-html/pearson-1.png" width="720" style="display: block; margin: auto auto auto 0;" /> $$ r = \frac{\sum (x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum (x_i - \bar{x})^2 \sum (y_i - \bar{y})^2}} $$ --- name: Spearman # Types of correlation - Spearman's rank correlation coefficient: Measures the monotonic relationship between two **ranked** variables. - Assumptions: - It is a non-parametric approach and does not require the data to be linearly correlated. - The data is not normally distributed. - For both conrinuous and ordinal (categorical) variables. <img src="slide_r_basic_statistic_files/figure-html/spearman-1.png" width="576" style="display: block; margin: auto auto auto 0;" /> $$ \rho = 1 - \frac{6 \sum d_i^2}{n(n^2 - 1)} $$ --- name: closing # More on statistics? - We discussed about very basic descriptive statistical measures. - You can read more [here](https://nbisweden.github.io/workshop-mlbiostatistics/session-descriptive/docs/index.html). <!-- --------------------- Do not edit this and below --------------------- --> --- name: end_slide class: end-slide, middle count: false # See you at the next lecture! .end-text[ <p class="smaller"> <span class="small" style="line-height: 1.2;">Graphics from </span><img src="./assets/freepik.jpg" style="max-height:20px; vertical-align:middle;"><br> Created: 31-Oct-2024 • <a href="https://www.scilifelab.se/">SciLifeLab</a> • <a href="https://nbis.se/">NBIS</a> </p> ]