Brief introduction to statistics

class: center, middle, inverse, title-slide

.title[
# Brief introduction to statistics
]
.subtitle[
## Statistics
]
.author[
### Nima Rafati
]

---

exclude: true
count: false

---
name: intro

# Introduction

**Why do we need statistics in our analysis?**

- Make data understandable and insightful.

- Evaluate patterns and trends.

- Identify and quantify differences/similarities between groups.

**Types of statistics:**

- Descriptive statistics: To summarize and describe main features of a dataset (Mean, median,...).

- Inferential statistics: To make prediction or inferences about a population using a sample of data (Hypothesis testing, regression analysis,...).

- Predictive statistics: To make predictions about future outcomes based on collected data (Regression models, time series forecasting, machine learning,...).

- ......

---
name: Descriptive
# Types of Descriptive Statistics

Descriptive statistics helps to:

- Summarize and describe the data.

- Visualize the data.

- Identify patterns (trends) and outliers in the data.

- Provide insights for downstream-analysis.

---
name: SomeStats
# Some of the basic descriptive statistics

1. **Measures of Central Tendency**
    - Mean, Median, Mode.
2. **Measures of Spread**
    - Range, Interquartile Range, Standard Deviation, Variance.
3. **Correlation**
    - Relation between two variables (e.g. Pearson's correlation).

---
name: Mean 
# Central Tendency: Mean
- Mean: The average value of data.  
$$
\bar{x} = \frac{1}{n} \sum_{i=1}^{n} x_i
$$

``` r
mean(data$var1)
mean(data$var2)
```

```
## [1] 47.46736
## [1] 100.081
```

---
name: Median

# Central Tendency: Median

- Median: The middle value with the data is sorted.
<img src="slide_r_basic_statistic_files/figure-html/Median-1.png" width="720" style="display: block; margin: auto auto auto 0;" />

``` r
median(data$var1)
median(data$var2)
```

```
## [1] 40.06529
## [1] 100.0426
```

---
name: Mode
# Central Tendency: Mode

- Mode: The most frequently occurring value. 
<img src="slide_r_basic_statistic_files/figure-html/Mode-plot-1.png" width="720" style="display: block; margin: auto auto auto 0;" />

``` r
density_data1 <- density(data$var1)
density_data1$x[which.max(density_data1$y)]
density_data2 <- density(data$var2)
density_data2$x[which.max(density_data2$y)]
```

```
## [1] 26.8621
## [1] 100.6456
```
---
name: Spread
# Measures of spread: Range and Interquartile Range. 
- Range: Difference between maximum `max(data$var2)`  and minimum `min(data$var2)`.  
- Interquartile Range: Data is represented in four equally sized groups (bins) known as **Quartile** and the distance between quartile is called **Interquartile Range** (IQR).

---
name: Variance
# Measures of spread: Variance

- Variance: How far the data points are spread out from the mean. Unit is the square of the data's unit (e.g. `$cm^2$` ).

$$
\sigma^2 = \frac{1}{n} \sum_{i=1}^{n} (x_i - \bar{x})^2
$$

``` r
var(data$var2)
```

```
## [1] 395.8689
```
---
name: Stdev
# Measures of spread: Standard deviation

- Standard deviation (sd): is the square root of the variance and provides a more intuitive measure of spread. Despite of variance, sd has the same unit as the data (e.g. cm).

$$
\sigma =\sqrt{\sigma^2}
$$

<img src="slide_r_basic_statistic_files/figure-html/sd-plot-1.png" width="504" style="display: block; margin: auto auto auto 0;" />
---
name: correlation
# Correlation

- Measuring the strength and direction of the **linear**  relationship between two variables.

- Positive Correlation: As one variable increases, the other also increases.
  
  - Negative Correlation: As one variable increases, the other decreases.
  
  - No Correlation: No directional relationship between the variables.
  
---
name: Pearson
# Types of correlation 
- Pearson's correlation coefficient: Correlation of two **continuous** variables.
- Assumptions:  
   - Linear relationship. 
   - Normally distributed variables.

$$
r = \frac{\sum (x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum (x_i - \bar{x})^2 \sum (y_i - \bar{y})^2}}
$$

---
name: Spearman
# Types of correlation 
- Spearman's rank correlation coefficient: Measures the monotonic relationship between two **ranked** variables. 
- Assumptions: 
 - It is a non-parametric approach and does not require the data to be linearly correlated. 
 - The data is not normally distributed. 
 - For both conrinuous and ordinal (categorical) variables. 
<img src="slide_r_basic_statistic_files/figure-html/spearman-1.png" width="576" style="display: block; margin: auto auto auto 0;" />

$$
\rho = 1 - \frac{6 \sum d_i^2}{n(n^2 - 1)}
$$
---
name: closing
# More on statistics?
- We discussed about very basic descriptive statistical measures. 
- You can read more [here](https://nbisweden.github.io/workshop-mlbiostatistics/session-descriptive/docs/index.html).

---
name: end_slide
class: end-slide, middle
count: false

# See you at the next lecture!

.end-text[

Graphics from <img src="./assets/freepik.jpg" style="max-height:20px; vertical-align:middle;"> 
Created: 07-Oct-2025 • <a href="https://www.scilifelab.se/">SciLifeLab</a> • <a href="https://nbis.se/">NBIS</a> 

]