6  Measures of location

Code
# load libraries
library(tidyverse)
library(magrittr)
library(kableExtra)
library(faraway)
library(ggbeeswarm)
library(gridExtra)
library(reshape2)

# define generic ggplot theme
font.size <- 12
col.blue.light <- "#a6cee3"
col.blue.dark <- "#1f78b4"
my.ggtheme <- theme(axis.title = element_text(size = font.size), 
        axis.text = element_text(size = font.size), 
        legend.text = element_text(size = font.size), 
        legend.title = element_blank(), 
        legend.position = "top", 
        axis.title.y = element_text(angle = 0)) 

# add obesity and diabetes status to diabetes faraway data
inch2m <- 2.54/100
pound2kg <- 0.45
data_diabetes <- diabetes %>%
  mutate(height  = height * inch2m, height = round(height, 2)) %>% 
  mutate(waist = waist * inch2m) %>%  
  mutate(weight = weight * pound2kg, weight = round(weight, 2)) %>%
  mutate(BMI = weight / height^2, BMI = round(BMI, 2)) %>% 
  mutate(obese= cut(BMI, breaks = c(0, 29.9, 100), labels = c("No", "Yes"))) %>% 
  mutate(diabetic = ifelse(glyhb > 7, "Yes", "No"), diabetic = factor(diabetic, levels = c("No", "Yes"))) %>%
  na.omit()

It is not always easy to get a “feeling” for a set of numerical measurements unless we summarize the data in a meaningful way. Diagrams, as shown in the previous chapter, are often a good starting point. We can further condense the information by reporting what constitutes a representative value. If we also know how widely scattered the observations are around it, we can formulate an image of data. The average is a general term for a measure of location and some common ways of calculating the average are mode, mean and median.

6.1 Mode

Mode values is the value that most common occurs across the measurements. It can be found for numerical and categorical data types.

For instance, we can find age mode value by counting how many times we observe each age value among the study participants. The mode values is the most commonly occurring one, here 63, with 9 participants being 63 at the time of the study.

Code
data_diabetes %>%
  group_by(age) %>%
  tally() %>%
  arrange(desc(n)) %>%
  slice(1:3) %>%
  kbl() %>%
  kable_styling()
Table 6.1: Top three most observed age values among the study participants.
age n
63 9
50 6
40 5

Analogously, we can find mode value for the categorical diabetic status by counting how many of the participants are diabetic and how many are not. Here, the mode is No with 100 study participants having glycosolated hemoglobin level lower than 7.0 and 30 above.

Code
data_diabetes %>%
  group_by(diabetic) %>%
  tally() %>%
  arrange(desc(n)) %>%
  slice(1:3) %>%
  kbl %>%
  kable_styling()
Table 6.2: Number of times diabetic status of No and Yes is observed for the study participants.
diabetic n
No 100
Yes 30

6.2 Median

Median value divides the ordered data values into two equally sized groups so that 50% of the values are below and 50% are above the median value. For the odd number of observations, the middle value is \((n+1)/2\)-th term of the ordered observations. For even number of observations, it is the average of the middle two terms of the ordered observations.

\[\begin{equation} Median = \left\{ \begin{array}{cc} \frac{(n+1)}{2}^{th} term & \mathrm{if\ } n \mathrm{\ is\ odd} \\ \frac{1}{2}\times \left (\frac{n}{2}^{th} term + (\frac{n}{2}+1)^{th} term \right) & \mathrm{if\ } n \mathrm{\ is\ even} \\ \end{array} \right. \end{equation}\]

For instance, the median value for age for the first 10 study participants:

Code
age_10 <- data_diabetes %>%
  slice(1:10)  # select first 10 participants

# show age for 10 first participants
age_10 %>%
  select(id, age) %>% 
  pivot_wider(names_from = id, values_from = age) %>%
  kbl() %>%
  kable_styling() 
1002 1011 1016 1024 1036 1252 1253 1256 1271 1285
58 30 45 60 33 70 47 66 24 40
Age values for the first 10 study participants.

can be found by ordering observations:

Code
age_10_ordered <- data_diabetes %>%
  slice(1:10) %>% # select first 10 participants
  arrange(age) %>% # order by age
  pull(age) # extract ordered age observations

data_diabetes %>%
  slice(1:10) %>% # select first 10 participants
  arrange(age) %>%
  select(id, age) %>% 
  pivot_wider(names_from = id, values_from = age) %>%
  kbl() %>%
  kable_styling() %>%
  column_spec(c(5), background = "gold") %>%
  column_spec(c(6), background = "gold")
1271 1011 1036 1285 1016 1253 1002 1024 1256 1252
24 30 33 40 45 47 58 60 66 70

and averaging \(5^{th}\) and \(6^{th}\) term in the ordered observations giving a median value of:

1/2*(age_10_ordered[5] + age_10_ordered[6])
[1] 46

The median value for age for the first 11 study participants:

Code
age_11 <- data_diabetes %>%
  slice(1:11)  # select first 11 participants

# show age for 10 first participants
age_11 %>%
  select(id, age) %>% 
  pivot_wider(names_from = id, values_from = age) %>%
  kbl() %>%
  kable_styling() 
1002 1011 1016 1024 1036 1252 1253 1256 1271 1285 1301
58 30 45 60 33 70 47 66 24 40 42
Age values for the first 10 study participants.

can be found by taking \(6^{th}\) term from the ordered observations:

Code
age_11_ordered <- data_diabetes %>%
  slice(1:11) %>% # select first 11 participants
  arrange(age) %>% # order by age
  pull(age) # extract ordered age observations

data_diabetes %>%
  slice(1:11) %>% # select first 11 participants
  arrange(age) %>%
  select(id, age) %>% 
  pivot_wider(names_from = id, values_from = age) %>%
  kbl() %>%
  kable_styling() %>%
  column_spec(c(6), background = "gold")
1271 1011 1036 1285 1301 1016 1253 1002 1024 1256 1252
24 30 33 40 42 45 47 58 60 66 70

giving a median value of:

age_11_ordered[6]
[1] 45

Alternatively, we can use median() function to check our calculations:

data_diabetes %>%
  dplyr::slice(1:10) %$%
  median(age)
## [1] 46

data_diabetes %>%
  dplyr::slice(1:11) %$%
  median(age)
## [1] 45

6.3 Arthimetic mean

The arithmetic mean, also commonly referred to as mean, is calculated by adding up all the values and diving the sum by the number of values in the data set.

Mathematically, for \(n\) observations \(x_1, x_2, \dots, x_n\), the arithmetic mean value is calculated as: \[\bar x = \frac{x_1+x_2+\dots+x_n}{n} = \frac{1}{n}\displaystyle\sum_{i=1}^n x_i \tag{6.1}\]

To calculate the arithmetic mean for BMI given our 130 study participants we can follow Equation 6.1

# calculate arithmetic mean following the equation
x <- data_diabetes %>%
  pull(BMI) # extract BMI observations
n <- length(x) # number of observations
x.bar <- sum(x) / n # calculate mean
print(x.bar)
[1] 29.97962
BMI_mean <- round(x.bar, 2)

or use basic mean() function in R:

data_diabetes %$%
  mean(BMI) %>%
  print()
[1] 29.97962

6.4 Weighted mean

As all the values equally contribute to the calculations, the arithmetic mean value is easily affected by outliers and is distorted by skewed distributions. Sometimes, the weighted mean may be more useful, as it allows add weights to certain values of the variable of interest. We attach a weight, \(w_i\) to each of the observed values, \(x_i\), in our sample, to reflect this importance and define the weighted mean as: \[\bar{x} = \frac{w_1x_1 + w_2x_2 + \ldots + w_nx_n}{w_1 + w_2 + \ldots + w_n} = \frac{\displaystyle\sum_{i=1}^{n}w_ix_i}{\displaystyle\sum_{i=1}^{n}w_i} \tag{6.2}\]

For instance, we may be interested in knowing an average BMI value, irrespective of gender. It happens that among our study participants women are over represented:

Code
data_diabetes %>%
  group_by(gender) %>%
  tally() %>%
  kbl() %>%
  kable_styling()
gender n
male 57
female 73
Number of male and female study participans.

Assuming BMI measurements for men and women should have equal influence (50/50), we can calculate weighted BMI mean to account for group sizes. We assign weights to BMI observations for men and women so that they sum up to 100. Since we have 73 measurements for women, the corresponding weights are \(w_f = 50 / 73 = 0.6849315\) and \(w_m = 50 / 57 = 0.877193\) for measurements reported for men. The weighted mean can now be calculated following Equation 6.2 and is equal to:

# number of women
n_w <- data_diabetes%>%
  filter(gender == "female") %>%
  nrow()

# number of men
n_m <- data_diabetes%>%
  filter(gender == "male") %>%
  nrow()

# add weights to observations
data_diabetes_addweights <- data_diabetes %>%
  mutate(w = ifelse(gender == "male", 50/n_m, 50/n_w)) %>% # assign weights
  mutate(wx = BMI * w)  # multiply weight by their weights values
  
numerator <- data_diabetes_addweights %$%
  sum(wx)

denominator <- data_diabetes_addweights %$%
  sum(w)

BMI_weighted_mean <- numerator / denominator
  
BMI_weighted_mean <- BMI_weighted_mean %>%
  round(2) %>%
  print()
[1] 29.74

We previously calculated BMI mean of 29.98 and mean BMI for men and women are:

Code
data_diabetes %>%
  group_by(gender) %>%
  summarize(mean_BMI = mean(BMI)) %>%
  kbl(digits = 2) %>%
  kable_styling()
gender mean_BMI
male 27.77
female 31.71
Mean BMI values for men and women in the study.

We can note, as expected, that the weighted mean of BMI 29.74 is shifted slightly towards mean BMI for men since the BMI measurements for men have been assigned higher weights to account for women being overrepresented in the study.

6.5 Mean, median & outliers

Median is usually preferred when data has outliers as it follows from median definition that is less sensitive to outliers. On the other hand, mean value can be distorted when outliers are present. Let’s add an outlying value of age (110) to the first 11 study participants, and re-calculate mean and median.

Code
# pull age values for the first 11 study participants
age_11 <- data_diabetes %>%
  slice(1:11) %>%
  select(id, age) %>%
  pull(age)
  
# add outlier value of 110
age_11_with_outlier <- c(age_11, 110)

# calculate mean and median, with and without outlier

# without outlier
age_mean_without <- mean(age_11) %>% round(2)
age_median_without <- median(age_11)

# with outlier
age_mean_with <- mean(age_11_with_outlier) %>% round(2)
age_median_with <- median(age_11_with_outlier)

res <- data.frame(mean = c(age_mean_without, age_mean_with), 
                  median = c(age_median_without, age_median_with), 
                  row.names = c("without outlier", "with outlier"))

res %>%
  kbl() %>%
  kable_styling()
mean median
without outlier 46.82 45
with outlier 52.08 46
Comparision of mean and median age values before and after adding outlying age value of 110 to the first 11 study participants.

We can see that adding one outlying age value shifted mean age from 46.82 to 52.08 while median age value did not change that much with original median value being 45 and 46 after adding the outlying value.

In addition, it is good to remember that several very different distributions can still have the same mean value.

Figure 6.1: Examples of various distributions having the same mean value of 3.5

6.6 More averages

Beyond the common averages measure above, there is many more that one may encounter. The trimmed mean reduces the influence of outliers by removing a specified percentage of extreme values before calculating the mean, making it valuable in clinical trials where extreme results may skew data. The geometric mean is crucial in pharmacokinetics for calculating average rates of drug absorption or clearance, as it computes the nth root of the product of all data points. The moving average is commonly used in epidemiology to smooth out daily case reports of diseases, helping to visualize trends by averaging data points within a sliding window over time. The Hodges-Lehmann estimator, a robust statistic for central tendency, is less affected by outliers. It calculates the median of all possible averages of sample pairs, ideal for non-parametric analyses in studies where data may not be normally distributed, such as environmental exposure assessments

Trimmed Mean

\[\text{Trimmed Mean} = \frac{\sum_{i=p+1}^{n-p} x_{(i)}}{n - 2p}\] Here, \(x_{(i)}\) are the ordered observations from smallest to largest, \(n\) is the total number of observations, and \(p\) is the number of extreme values removed from each end of the dataset.

Geometric Mean \[\text{Geometric Mean (GM)} = \left(\prod_{i=1}^{n} x_i\right)^{1/n}\] The geometric mean is the nth root of the product of all \(n\) data points, \(x_i\).

Moving Average \[\text{Moving Average} = \frac{1}{k} \sum_{i=j}^{j+k-1} x_i\] This averages \(k\) consecutive data points in a series, starting from point \(j\).

Hodges-Lehmann Estimator \[\text{Hodges-Lehmann Estimator} = \text{Median} \left( \frac{x_i + x_j}{2} \right)\] This estimator is calculated by taking all possible pairs \((x_i, x_j)\) of sample observations, computing their average, and then finding the median of these averages.