flowchart LR A(Data types) --> B(Categorical) A --> C(Numerical)
Two main types of statistics
Descriptive statistics
Descriptive statistics is a term describing simple analyses of data to help getting to know the data by:
Beyond getting to know the data descriptive statistics is used to:
numerical & categorical data types
One of the first thing we tend to notice about the data is the data type. We differentiate between categorical (qualitative) and numerical (quantitative) data types.
flowchart LR A(Data types) --> B(Categorical) A --> C(Numerical)
Depending on the data type we use different methods to describe, summarize and visualize the data. Beyond descriptive statistics, we even use different methods to analyse the data.
Categorical data
Categorical data can be further divided into:
flowchart LR A(Data types) --> B(Categorical) A --> C(Numerical) B(Categorical) --> D(Nominal i.e. named) B --> E(Ordinal i.e. named and ordered)
Numerical data
Numerical data can be further divided into:
flowchart LR A(Data types) --> B(Categorical) A(Data types) --> C(Numerical) C(Numerical) --> D(Discrete i.e. finite or countable infinite values) C(Numerical) --> E(Continuous i.e. infinitely many uncountable values)
Example data set
faraway
package.Abbreviation | Description |
---|---|
id | Subject ID |
chol | Total Cholesterol [mg/dL] |
stab.glu | Stabilize Glucose [mg/dL] |
hdl | High Density Lipoprotein [mg/dL] |
ratio | Cholesterol / HDL Ratio |
glyhb | Glycosolated Hemoglobin [%] |
location | County: Buckingham or Louisa |
age | age [years] |
gender | gender |
height | height [in] |
weight | weight [lb] |
frame | frame: small, medium or large |
bp.1s | First Systolic Blood Pressure |
bp.1d | First Diastolic Blood Pressure |
bp.2s | Second Systolic Blood Pressure |
bp.2d | Second Diastolic Blood Pressure |
waist | waist [in] |
hip | hip [in] |
time.ppn | Postprandial Time [min] when labs were drawn |
Example data set
diabetic
(yes/no) reflecting this information.obese
variable (yes/no).Rows: 6
Columns: 22
$ id <int> 1002, 1011, 1016, 1024, 1036, 1252
$ chol <int> 228, 195, 177, 242, 213, 186
$ stab.glu <int> 92, 92, 87, 82, 83, 97
$ hdl <int> 37, 41, 49, 54, 47, 50
$ ratio <dbl> 6.2, 4.8, 3.6, 4.5, 4.5, 3.7
$ glyhb <dbl> 4.64, 4.84, 4.84, 4.77, 3.41, 6.49
$ location <fct> Buckingham, Buckingham, Buckingham, Louisa, Louisa, Buckingham
$ age <int> 58, 30, 45, 60, 33, 70
$ gender <fct> female, male, male, female, female, male
$ height <dbl> 1.55, 1.75, 1.75, 1.65, 1.65, 1.70
$ weight <dbl> 115.20, 85.95, 74.70, 70.20, 70.65, 80.10
$ frame <fct> large, medium, large, medium, medium, large
$ bp.1s <int> 190, 161, 160, 130, 130, 148
$ bp.1d <int> 92, 112, 80, 90, 90, 88
$ bp.2s <int> 185, 161, 128, 130, 120, 148
$ bp.2d <int> 92, 112, 86, 90, 96, 84
$ waist <dbl> 1.2446, 1.1684, 0.8636, 0.9906, 0.9398, 1.0668
$ hip <int> 57, 49, 40, 45, 41, 41
$ time.ppn <int> 180, 720, 300, 300, 240, 1020
$ BMI <dbl> 47.95, 28.07, 24.39, 25.79, 25.95, 27.72
$ obese <fct> Yes, No, No, No, No, No
$ diabetic <fct> No, No, No, No, No, No
Summarizing categorical data
Summarizing categorical data
Let’s preview again first few measurements of diabetes data set focusing on gender
and obese
variables.
id | gender | obese |
---|---|---|
1002 | female | Yes |
1011 | male | No |
1016 | male | No |
1024 | female | No |
1036 | female | No |
1252 | male | No |
1253 | male | Yes |
1256 | female | Yes |
1271 | female | No |
1285 | male | No |
gender
and obese
status falls under categorical data type.Frequency table. Bar and pie charts.
Frequency table shows the number, percentages and proportions of study participants with BMI \(\ge\) 30 and with BMI < 30.
obese | n | percent (%) | proportion |
---|---|---|---|
No | 72 | 55.4 | 0.6 |
Yes | 58 | 44.6 | 0.4 |
Bar chart
Pie chart
Summary and contingency table: 2 categorical variables
When we are interested in how one categorical variable is related to another categorical variable, we can use a summary table. For instance, we can look at the relationship between obesity (yes/no) and diabetes (yes/no).
obese | Total | Diabetic | Diabetic (%) |
---|---|---|---|
No | 72 | 39 | 54.17 |
Yes | 58 | 18 | 31.03 |
Contingency table, sometimes called two-way frequency table, shows the multivariate frequency distribution of variables.
Non-diabetic | Diabetic | Sum | |
---|---|---|---|
Non-obese | 57 | 15 | 72 |
Obese | 43 | 15 | 58 |
Sum | 100 | 30 | 130 |
Bar charts: 2 categorical variables
Bar charts can be used to visualize two and more categorical variables, e.g. by using stacking, side-by-side bars or colors.
Mosaic plot
Summarizing numerical data
Numerical data can be visualized and summarized in many ways. Common plots include histograms, density plots and scatter plots. Summary statistics include measures of location such as mode and median and measures of spread such as variance or median absolute deviation. It is also common to visualize summary statistics, e.g. on box plot.
flowchart TD A(Numerical data) --> B(Numerical summary) A(Numerical data) --> C(Graphical summary) B(Numerical summary) --> D(Measures of location <br/> e.g. mode, average, median) B --> E(Measures of spread <br/> e.g. quartiles, variance, standard deviation) C(Graphical summary) --> F(Histogram <br/> Density plot <br/> Box plot <br/> ...)
Strip plot, Jittered strip plot & Beeswarm plot
If it is technically feasible, it is recommended to visually assess all measurements on a plot.
Histogram & density plot
A histogram bins the data and counts the number of observations that fall into each bin. A density plot is like a smoothed histogram where the total area under the curve is set to 1. A density plot is an approximation of a distribution.
Scatter plot: 2 numerical variables
Scatter plots are useful when studying a relationship (association) between two numerical variables.
Scatter plot: 2 numerical variables cont.
Sometimes, it is useful to connect the observations in the order in which they appear, e.g. when analyzing time series data. The diabetes data set does not contain any measurements over time but we can simulate some BMI values over time for demonstration purposes.
flowchart LR A(Representative value) --> C(Image of data) B(Spread) --> C(Image of data)
Mode
age
mode value by counting how many times we observe each age value among the study participants. The top three counts are below and the mode is thus 63.age | n |
---|---|
63 | 9 |
50 | 6 |
40 | 5 |
diabetic status
by counting how many of the participants are diabetic and how many are not.diabetic | n |
---|---|
No | 100 |
Yes | 30 |
Median
Median value divides the ordered data values into two equally sized groups so that 50% of the values are below and 50% are above the median value.
\[\begin{equation} Median = \left\{ \begin{array}{cc} \frac{(n+1)}{2}^{th} term & \mathrm{if\ } n \mathrm{\ is\ odd} \\ \frac{1}{2}\times \left (\frac{n}{2}^{th} term + (\frac{n}{2}+1)^{th} term \right) & \mathrm{if\ } n \mathrm{\ is\ even} \\ \end{array} \right. \end{equation}\]For instance, the median value for age
for the first 10 study participants:
can be found by ordering observations:
1271 | 1011 | 1036 | 1285 | 1016 | 1253 | 1002 | 1024 | 1256 | 1252 |
---|---|---|---|---|---|---|---|---|---|
24 | 30 | 33 | 40 | 45 | 47 | 58 | 60 | 66 | 70 |
and averaging \(5^{th}\) and \(6^{th}\) term in the ordered observations giving a median value of:
[1] 46
Mean & weighted mean
The arithmetic mean, also commonly referred to as mean, is calculated by adding up all the values and diving the sum by the number of values in the data set.
Mathematically, for \(n\) observations \(x_1, x_2, \dots, x_n\), the arithmetic mean value is calculated as: \[\bar x = \frac{x_1+x_2+\dots+x_n}{n} = \frac{1}{n}\displaystyle\sum_{i=1}^n x_i \qquad(1)\]
Weighted mean allows to add weights to certain values of the variable of interest. We attach a weight, \(w_i\) to each of the observed values, \(x_i\), in our sample, to reflect this importance and define the weighted mean as: \[\bar{x} = \frac{w_1x_1 + w_2x_2 + \ldots + w_nx_n}{w_1 + w_2 + \ldots + w_n} = \frac{\displaystyle\sum_{i=1}^{n}w_ix_i}{\displaystyle\sum_{i=1}^{n}w_i} \qquad(2)\]
Mean & weighted mean
Example 1 For instance, we may be interested in knowing an average BMI
value, irrespective of gender
. It happens that among our study participants women are over represented:
gender | n |
---|---|
male | 57 |
female | 73 |
Assuming BMI measurements for men and women should have equal influence (50/50) and knowing BMI average for men and women separately:
gender | mean_BMI |
---|---|
male | 27.77 |
female | 31.71 |
What is the weighted BMI mean?
Mean, median & outliers
Median is usually preferred when data has outliers as it follows from median definition that is less sensitive to outliers. On the other hand, mean value can be distorted when outliers are present.
Example 2 Let’s add an outlying value of age (110) to the first 11 study participants, and re-calculate mean and median.
mean | median | |
---|---|---|
without outlier | 46.82 | 45 |
with outlier | 52.08 | 46 |
We can see that adding one outlying age value shifted mean age from 46.82 to 52.08 while median age value did not change that much with original median value being 45 and 46 after adding the outlying value.
Mean, median & outliers
In addition, it is good to remember that several very different distributions can still have the same mean value.
Range, quartiles and IQR
{width = 80%}
Variance and standard deviation
The variance of a set of observations is their mean squared distance from the mean value:
\[\sigma^2 = \frac{1}{n} \sum_{i=1}^n (x_i - \bar x)^2 \qquad(3)\]
Variance and standard deviation
Standard deviation is defined as the square root of the variance:
\[\sigma = \sqrt{\frac{1}{n} \sum_{i=1}^n (x_i - \bar x)^2} \qquad(4)\]
Sample variance and standard deviation
Typically, we regard the collection of observations \(x_1, \dots, x_n\) as a sample drawn from a large population of possible observations. It has been shown that we obtain a better sample estimate of the population variance and standard deviation if we divide by \((n-1)\). So the denominator \(n\) is commonly replaced by \(n-1\) and the sample variance is calculated instead as:
\[s^2 = {\frac{1}{n-1} \sum_{i=1}^n (x_i - \bar x)^2}. \qquad(5)\]
and the sample standard deviation is calculated as:
\[s = \sqrt{\frac{1}{n-1} \sum_{i=1}^n (x_i - \bar x)^2}. \qquad(6)\]
Boxplot
Boxplot
Descriptive stats in context
flowchart LR A(Define problem) --> B(Collect data) B --> C(Clean data) C --> D(Explore data) D --> E(Inferential statistics) D --> F(Predictive modelling) E --> G(Communicate results) F --> G(Communicate results)
Descriptive stats in context
flowchart LR A(Define problem) --> B(Collect data) B --> C(Clean data) C --> D(Explore data) D --> E(Inferential statistics) D --> F(Predictive modelling) E --> G(Communicate results) F --> G(Communicate results) G --> A G --> B G --> C G --> D G --> E G --> F E --> A E --> B E --> C E --> D E --> F F --> B F --> C F --> D D --> G D --> C D --> A C --> A
And descriptive statistics
flowchart LR A(Data) --> B(Descriptive stats.) B --> C(Feature eng.) C --> D(ML model) C --> B
scaling & normalization
common transformations
square-root for moderate skew
log for greater skew
inverse for severe skew
dummy variables
obese
with three possible vales (underweight, healthy, overweight) can be transformed into two binary variables: “is_healthy”, and “is_overweight”, where the value of each variable is 1 if the observation belongs to that category and 0 otherwise. Only \(k-1\) binary variables to encode \(k\) categories.missing data
Rubin’s (1976) missing data classification system
“Missing Data: A Gentle Introduction by Patrick E. McKnight, Katherine M. McKnight, Souraya Sidani, and Aurelio Jose Figueredo” (2008)
handling imbalanced data
misc
BMI
variables based on weight
and height
It is always a good idea to look at the raw measurements, printing them all for smaller data sets or printing randomly selected measurements from bigger data sets.
There are many R packages to calculates descriptive statistics and presents the results in customizable summary table ready for publication, e.g. gtsummary
or arsenal
.
Similarly, there are many ways of visualizing the data. For inspiration The R Graph Gallery https://r-graph-gallery.com
We will later learn how to use tidymodels
framework to run ML models tasks, including feature engineering steps with recipes
tidy interface.
Characteristic | No, N = 2531 | Yes, N = 1441 |
---|---|---|
age | 47 (17) | 47 (16) |
chol | 206 (43) | 212 (46) |
Unknown | 1 | 0 |
gender | ||
male | 128 (51%) | 40 (28%) |
female | 125 (49%) | 104 (72%) |
1 Mean (SD); n (%) |
questions?