Code
# load libraries
library(tidyverse)
library(kableExtra)
library(ggplot2)
library(ggbeeswarm)
library(gridExtra)
# load libraries
library(tidyverse)
library(kableExtra)
library(ggplot2)
library(ggbeeswarm)
library(gridExtra)
Exercise 11.1 (Summarize diabetes data) Use below code to load diabetes data set and calculate BMI
and add categorical variable obese
(Yes) if \(BMI \ge 30\) and No otherwise. Summarize variables: obese
, age
and gender
reporting mean and sample standard deviation for numerical variables and counts and percentage per group for categorical variables.
Can you figure out how to use arsenal
and/or gtsummary
packages to check your results and generate publication ready table?
library(faraway)
library(tidyverse)
<- 2.54/100
inch2m <- 0.45
pound2kg <- diabetes %>%
data_diabetes mutate(height = height * inch2m, height = round(height, 2)) %>%
mutate(waist = waist * inch2m) %>%
mutate(weight = weight * pound2kg, weight = round(weight, 2)) %>%
mutate(BMI = weight / height^2, BMI = round(BMI, 2)) %>%
mutate(obese= cut(BMI, breaks = c(0, 29.9, 100), labels = c("No", "Yes")))
Exercise 11.2 (Plot diabetes data) Try various plots to visualize diabetes variables: BMI
and gender
. Start by making a histogram and density plot for BMI and box plot of BMI stratified by gender status. Can you think of any other plots that may be useful here to visualize the relationships between BMI and gender?
Solution. Exercise 11.1
Age is a numerical variable and we can calculate mean and sample standard deviation for example as below:
%>%
data_diabetes select(age) %>%
summarize(age_mean = mean(age, na.rm = T),
age_sd = sd(age, na.rm = T)) %>%
print()
age_mean age_sd
1 46.85112 16.31233
Gender and obesity status are categorical variables and we can calculate counts and percentages per groups as below:
<- data_diabetes %>%
summary_gender select(gender) %>%
group_by(gender) %>%
summarize(n = n()) %>%
mutate(percent = n * 100 / nrow(data_diabetes)) %>%
print()
# A tibble: 2 × 3
gender n percent
<fct> <int> <dbl>
1 male 169 41.9
2 female 234 58.1
<- data_diabetes %>%
summary_obese select(obese) %>%
group_by(obese) %>%
summarize(n = n()) %>%
mutate(percent = n * 100 / nrow(data_diabetes)) %>%
print()
# A tibble: 3 × 3
obese n percent
<fct> <int> <dbl>
1 No 253 62.8
2 Yes 144 35.7
3 <NA> 6 1.49
Alternatively, we can use one of the many R data summaries packages, for instance arsenal
to summarize obesity status by age and gender.
library(arsenal)
<- tableby(obese ~ gender + age, data=data_diabetes)
tab1 summary(tab1)
No (N=253) | Yes (N=144) | Total (N=397) | p value | |
---|---|---|---|---|
gender | < 0.001 | |||
male | 128 (50.6%) | 40 (27.8%) | 168 (42.3%) | |
female | 125 (49.4%) | 104 (72.2%) | 229 (57.7%) | |
age | 0.734 | |||
Mean (SD) | 47.103 (16.745) | 46.521 (15.831) | 46.892 (16.402) | |
Range | 19.000 - 91.000 | 20.000 - 92.000 | 19.000 - 92.000 |
Another popular package is gtsummary
that calculates descriptive statistics for continuous, categorical, and dichotomous variables in R, and presents the results in customizable summary table ready for publication.
library(gtsummary)
%>%
data_diabetes select(age, gender, obese) %>%
tbl_summary(by = obese,
statistic = list(all_continuous() ~ "{mean} ({sd})"))
Characteristic | No, N = 2531 | Yes, N = 1441 |
---|---|---|
age | 47 (17) | 47 (16) |
gender | ||
male | 128 (51%) | 40 (28%) |
female | 125 (49%) | 104 (72%) |
1 Mean (SD); n (%) |
Solution. Exercise 11.2
<- 12
font.size <- "#a6cee3"
col.blue.light <- "#1f78b4"
col.blue.dark <- theme(axis.title = element_text(size = font.size),
my.ggtheme axis.text = element_text(size = font.size),
legend.text = element_text(size = font.size),
legend.title = element_blank(),
legend.position = "top",
axis.title.y = element_text(angle = 0)) + theme_bw()
<- data_diabetes %>%
plt_hist ggplot(aes(x = BMI)) +
geom_histogram() +
my.ggtheme
<- data_diabetes %>%
plt_density ggplot(aes(x = BMI)) +
geom_density() +
my.ggtheme
<- data_diabetes %>%
plt_boxplot ggplot(aes(x = gender, y = BMI)) +
geom_boxplot() +
my.ggtheme
plt_hist
plt_density
plt_boxplot
In addition, we could for instance try beeswarm plot and/or histogram stratified by gender. Or we can try also overlaying box plots over the jitter plot either for all BMI variables or separately for males and females. Sometimes, it may be also a good idea to plot summary statistics, e.g. a barplot at a height of means and error bars representing standard deviation, error bars or confidence intervals. See this post for inspiration if you’d like to try plotting the summary statistics instead http://www.cookbook-r.com/Graphs/Plotting_means_and_error_bars_(ggplot2)/