Hands-on analysis of actual data is the best way to learn R programming. This page contains some data sets that you can use to explore what you are learning in this course as well as new things you might want to try. For each data set, a brief description, as well as download instructions, are provided. We have included some suggestions of analysis you could do with each data set, but you can do whatever you want with the data. The only goal is to exercise your R muscles!
If your group decides so, you are welcome to use a different data set not included in this page. It’s up to you!
Try to focus on using the tools from the course to explore the data, rather than worrying about producing a perfect report.
On the last day you will share your Rmd file (or rather, the resulting html report) with the class so we can discuss what your data was about.
penguins <- read.table("https://vincentarelbundock.github.io/Rdatasets/csv/heplots/peng.csv", header = T, sep = ",")
str(penguins)
## 'data.frame': 333 obs. of 9 variables:
## $ rownames : int 1 2 3 4 5 6 7 8 9 10 ...
## $ species : chr "Adelie" "Adelie" "Adelie" "Adelie" ...
## $ island : chr "Torgersen" "Torgersen" "Torgersen" "Torgersen" ...
## $ bill_length : num 39.1 39.5 40.3 36.7 39.3 38.9 39.2 41.1 38.6 34.6 ...
## $ bill_depth : num 18.7 17.4 18 19.3 20.6 17.8 19.6 17.6 21.2 21.1 ...
## $ flipper_length: int 181 186 195 193 190 181 195 182 191 198 ...
## $ body_mass : int 3750 3800 3250 3450 3650 3625 4675 3200 3800 4400 ...
## $ sex : chr "m" "f" "f" "f" ...
## $ year : int 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 ...
Species Differentiation: How do the mean and distribution of the three key body measurements (culmen length, flipper length, and body mass) compare across the three different penguin species? (Focus on boxplots and descriptive statistics).
Bivariate Relationships: Is there a linear relationship (correlation) between flipper length and body mass? Does the strength or direction of this relationship appear to be different for male versus female penguins?
Grouping/Summary: What is the average body mass for each combination of species and island? Which species/island combination has the largest or smallest average mass?
library(dplyr)
# this will download the csv file directly from the web
drinks <- read.table("https://vincentarelbundock.github.io/Rdatasets/csv/stevedata/nesarc_drinkspd.csv", header = T, sep = ",")
# the lines below will take a sample from the full data set
set.seed(seed = 2)
drinks <- sample_n(drinks, size = 3000, replace = F)
# and here we check the structure of the data
str(drinks)
## 'data.frame': 3000 obs. of 9 variables:
## $ rownames : int 11014 36044 15657 11851 14800 4914 25399 9864 32033 23401 ...
## $ idnum : int 11014 36044 15657 11851 14800 4914 25399 9864 32033 23401 ...
## $ ethrace2a: int 5 1 1 1 1 2 1 5 1 1 ...
## $ region : int 3 1 2 1 2 3 3 2 4 1 ...
## $ age : int 56 55 42 84 18 42 72 39 23 61 ...
## $ sex : int 1 1 1 1 1 1 1 1 0 1 ...
## $ marital : int 1 4 4 3 6 1 3 6 1 5 ...
## $ educ : int 4 4 3 1 2 2 2 2 2 3 ...
## $ s2aq8b : int NA 1 3 NA NA NA NA NA 4 NA ...
Drinking Frequency: What is the distribution of drinking frequency (or a similar intensity measure) across major demographic categories like sex, age group, or race/ethnicity? (Focus on bar charts or grouped histograms).
Comparison of Years: Has the proportion of heavy drinkers (or a specific drinking pattern) changed significantly between the 2001 and 2002 survey years?
Socioeconomic Factors: Are specific socioeconomic indicators (e.g., income, education level—if available) associated with a higher likelihood of being a drinker or a heavy drinker?
library(dplyr)
# this will download the csv file directly from the web
crashes <- read.table("https://vincentarelbundock.github.io/Rdatasets/csv/DAAG/nassCDS.csv", header = T, sep = ",")
# the lines below will take a sample from the full data set
set.seed(seed = 2)
crashes <- sample_n(crashes, size = 3000, replace = F)
# and here we check the structure of the data
str(crashes)
## 'data.frame': 3000 obs. of 16 variables:
## $ rownames : int 12117 13263 4806 11014 8465 21853 3276 3453 15657 17074 ...
## $ dvcat : chr "10-24" "25-39" "10-24" "10-24" ...
## $ weight : num 5363.9 29.5 107.4 194.2 98.9 ...
## $ dead : chr "alive" "alive" "alive" "alive" ...
## $ airbag : chr "none" "none" "none" "none" ...
## $ seatbelt : chr "belted" "belted" "none" "belted" ...
## $ frontal : int 0 1 0 0 1 1 1 1 0 0 ...
## $ sex : chr "f" "m" "f" "f" ...
## $ ageOFocc : int 20 38 20 59 40 19 30 43 39 31 ...
## $ yearacc : int 1999 2000 1998 1999 1999 2002 1997 1997 2000 2000 ...
## $ yearVeh : int 1984 1990 1991 1985 1996 2001 1989 1994 1990 1983 ...
## $ abcat : chr "unavail" "unavail" "unavail" "unavail" ...
## $ occRole : chr "driver" "driver" "pass" "driver" ...
## $ deploy : int 0 0 0 0 1 1 0 1 0 0 ...
## $ injSeverity: int 0 2 0 3 1 0 3 2 3 3 ...
## $ caseid : chr "75:85:2" "4:115:1" "9:1:2" "48:61:1" ...
Trend Analysis: How has the total number of sampled car accidents or fatalities (if severity is recorded) changed year-over-year from 1997 to 2002?
Causal Factors: What is the relationship between seatbelt use and the severity of injury (if available)? Does this relationship hold true across different driver age groups?
Temporal Patterns: Are there specific days of the week or times of day (if available) that have a disproportionately high number of accidents or fatalities?
library(dplyr)
# this will download the csv file directly from the web
gapminder <- read.table("https://vincentarelbundock.github.io/Rdatasets/csv/dslabs/gapminder.csv", header = T, sep = ",")
# here we filter the data to remove anything before the year 2000
gapminder <- gapminder |> filter(year >= 2000)
# and here we check the structure of the data
str(gapminder)
## 'data.frame': 1520 obs. of 10 variables:
## $ rownames : int 7401 7402 7403 7404 7405 7406 7407 7408 7409 7410 ...
## $ country : chr "Albania" "Algeria" "Angola" "Antigua and Barbuda" ...
## $ year : int 2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 ...
## $ infant_mortality: num 23.2 33.9 128.3 13.8 18 ...
## $ life_expectancy : num 74.7 73.3 52.3 73.8 74.2 ...
## $ fertility : num 2.38 2.51 6.84 2.32 2.48 1.3 1.87 1.76 1.37 2.05 ...
## $ population : int 3121965 31183658 15058638 77648 37057453 3076098 90858 19107251 8050884 8117742 ...
## $ gdp : num 3.69e+09 5.48e+10 9.13e+09 8.03e+08 2.84e+11 ...
## $ continent : chr "Europe" "Africa" "Africa" "Americas" ...
## $ region : chr "Southern Europe" "Northern Africa" "Middle Africa" "Caribbean" ...
Wealth and Health Correlation: What is the nature and strength of the relationship between a country’s GDP per capita and its life expectancy? Visualize this relationship, perhaps grouping or coloring by continent.
Regional Change Over Time: Which continent has seen the most significant average increase in life expectancy between the years 2000 and 2016?
Outlier Identification: Identify the top 5 countries with the highest life expectancy and the top 5 countries with the lowest GDP per capita in the most recent year (2016).
library(dplyr)
# this will download the csv file directly from the web
stackoverflow <- read.table("https://vincentarelbundock.github.io/Rdatasets/csv/modeldata/stackoverflow.csv", header = T, sep = ",")
# the lines below will take a sample from the full data set
set.seed(2)
stackoverflow <- sample_n(stackoverflow, size = 3000)
# and here we check the structure of the data
str(stackoverflow)
## 'data.frame': 3000 obs. of 22 variables:
## $ rownames : int 3925 5071 4806 2822 4512 4488 273 5469 3276 3453 ...
## $ Country : chr "Germany" "United States" "United States" "United States" ...
## $ Salary : num 80645 135000 85000 127000 4405 ...
## $ YearsCodedJob : int 19 20 5 20 2 4 3 3 5 3 ...
## $ OpenSource : int 1 1 1 1 0 0 0 1 0 0 ...
## $ Hobby : int 1 1 1 1 0 1 1 1 0 1 ...
## $ CompanySizeNumber : int 10000 10000 5000 20 1 100 1000 20 1000 10000 ...
## $ Remote : chr "Not remote" "Not remote" "Not remote" "Remote" ...
## $ CareerSatisfaction : int 10 8 7 7 8 6 8 5 8 8 ...
## $ Data_scientist : int 1 0 0 0 0 0 0 0 0 0 ...
## $ Database_administrator : int 1 0 0 0 0 0 0 0 0 0 ...
## $ Desktop_applications_developer : int 0 0 0 0 0 0 0 0 1 1 ...
## $ Developer_with_stats_math_background: int 1 0 0 0 0 0 0 0 0 0 ...
## $ DevOps : int 0 1 0 0 0 0 0 0 0 0 ...
## $ Embedded_developer : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Graphic_designer : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Graphics_programming : int 1 0 0 0 0 0 0 0 0 0 ...
## $ Machine_learning_specialist : int 1 0 0 0 0 0 0 0 0 0 ...
## $ Mobile_developer : int 1 0 0 0 1 0 0 0 0 0 ...
## $ Quality_assurance_engineer : int 1 0 0 0 0 0 0 0 0 0 ...
## $ Systems_administrator : int 1 0 0 0 0 0 0 0 0 0 ...
## $ Web_developer : int 1 1 0 0 1 1 1 1 1 0 ...
Salary Determinants: How does the distribution of salary vary based on the respondent’s years of professional coding experience and/or their highest level of formal education?
Language Popularity: Which primary programming languages (if listed) are most commonly used by the respondents? Do the salaries earned by developers who use these languages differ significantly?
Job Satisfaction: Is there a correlation between job satisfaction/happiness (if available) and company size or the amount of remote work (if available)?
library(dplyr)
# this will download the csv file directly from the web
doctor <- read.table("https://vincentarelbundock.github.io/Rdatasets/csv/AER/DoctorVisits.csv", header = T, sep = ",")
# the lines below will take a sample from the full data set
set.seed(2)
doctor <- sample_n(doctor, size = 3000)
# and here we check the structure of the data
str(doctor)
## 'data.frame': 3000 obs. of 13 variables:
## $ rownames : int 3925 5071 4806 2822 4512 4488 273 3276 3453 690 ...
## $ visits : int 0 0 0 0 0 0 1 0 0 1 ...
## $ gender : chr "male" "female" "male" "male" ...
## $ age : num 0.27 0.67 0.32 0.19 0.22 0.22 0.22 0.37 0.67 0.62 ...
## $ income : num 0.15 0.25 0.9 0.65 0.35 0.75 0.75 0.25 0.35 0.25 ...
## $ illness : int 0 1 0 1 0 1 0 1 1 5 ...
## $ reduced : int 0 0 0 0 0 0 0 14 0 2 ...
## $ health : int 0 0 0 0 0 0 1 4 0 2 ...
## $ private : chr "no" "no" "yes" "yes" ...
## $ freepoor : chr "no" "no" "no" "no" ...
## $ freerepat: chr "no" "yes" "no" "no" ...
## $ nchronic : chr "yes" "yes" "no" "no" ...
## $ lchronic : chr "no" "no" "no" "no" ...
Frequency and Demographics: What is the average number of doctor visits and how does this average compare across different income levels or health insurance statuses (if available)?
Predictive Factors: What demographic and health-related variables (e.g., chronic conditions, age, sex) appear to be the strongest predictors of a higher frequency of doctor visits?
Time Comparison: Was there a noticeable change in the average number of doctor visits or the proportion of people who visited a doctor between the survey years 1977 and 1978?
library(dplyr)
library(lubridate)
# this will download the file to your working directory
download.file(url = "https://maven-datasets.s3.amazonaws.com/Video+Game+Sales/Video+Game+Sales.zip", destfile = "video_game_sales.zip")
# this will unzip the file and read it into R
videogames <- read.table(unz(filename = "vgchartz-2024.csv", "video_game_sales.zip"), header = T, sep = ",", quote = "\"", fill = T)
# this will select rows corresponding to years 2001 and 2002
videogames <- filter(videogames, year(as_date(release_date)) %in% c(2001,2002))
# and here we check the structure of the data
str(videogames)
## 'data.frame': 3201 obs. of 14 variables:
## $ img : chr "/games/boxart/827563ccc.jpg" "/games/boxart/3570928ccc.jpg" "/games/boxart/7583871ccc.jpg" "/games/boxart/9261584ccc.jpg" ...
## $ title : chr "Grand Theft Auto: Vice City" "Grand Theft Auto III" "Medal of Honor: Frontline" "Crash Bandicoot: The Wrath of Cortex" ...
## $ console : chr "PS2" "PS2" "PS2" "PS2" ...
## $ genre : chr "Action" "Action" "Shooter" "Platform" ...
## $ publisher : chr "Rockstar Games" "Rockstar Games" "Electronic Arts" "Universal Interactive" ...
## $ developer : chr "Rockstar North" "DMA Design" "EA Los Angeles" "Traveller's Tales" ...
## $ critic_score: num 9.6 9.5 9 6.9 8.3 8.2 9.1 NA 9.4 7.3 ...
## $ total_sales : num 16.15 13.1 6.83 5.42 4.67 ...
## $ na_sales : num 8.41 6.99 2.93 2.07 1.94 2.71 2.66 3 3.36 2.03 ...
## $ jp_sales : num 0.47 0.3 0.17 0.24 0.08 0.03 0.01 0.05 0.01 NA ...
## $ pal_sales : num 5.49 4.51 2.75 2.29 1.95 1.51 1.29 1.11 0.21 1.56 ...
## $ other_sales : num 1.78 1.3 0.99 0.82 0.7 0.23 0.46 0.07 0.56 0.17 ...
## $ release_date: chr "2002-10-28" "2001-10-23" "2002-05-28" "2001-10-29" ...
## $ last_update : chr "" "" "" "" ...
Genre/Platform Dominance: Which video game genre and which console platform had the highest global sales across 2001 and 2002? Was there a significant shift in the leading genre/platform from 2001 to 2002?
Performance Metrics: Is there a correlation between a game’s sales and its user/critic score? Visualize this relationship with a scatter plot.
Publisher Success: Which publishers released the most games in the 2001-2002 period, and which publisher generated the highest total global sales?
library(dplyr)
# this will download the file to your working directory
download.file(url = "https://maven-datasets.s3.amazonaws.com/LEGO+Sets/LEGO+Sets.zip", destfile = "lego.csv.zip")
# this will unzip the file and read it into R
lego <- read.table(unz(filename = "lego_sets.csv", "lego.csv.zip"), header = T, sep = ",", quote = "\"", fill = T)
# this will select rows corresponding to years 2000-2009
lego <- filter(lego, year %in% seq(2000,2009,1))
# and here we check the structure of the data
str(lego)
## 'data.frame': 4304 obs. of 14 variables:
## $ set_id : chr "1086-1" "1177-1" "1196-1" "1197-1" ...
## $ name : chr "Bulk Bucket" "Santa's Truck" "Telekom Race Cyclist" "Telekom Race Cyclist and Television Motorbike" ...
## $ year : int 2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 ...
## $ theme : chr "Duplo" "Town" "Town" "Town" ...
## $ subtheme : chr "" "Special" "Telekom" "Telekom" ...
## $ themeGroup : chr "Pre-school" "Modern day" "Modern day" "Modern day" ...
## $ category : chr "Normal" "Normal" "Normal" "Normal" ...
## $ pieces : int 48 27 7 26 81 129 10 27 26 23 ...
## $ minifigs : int NA 1 1 3 3 8 2 NA NA 1 ...
## $ agerange_min : int NA NA NA NA NA NA NA NA NA NA ...
## $ US_retailPrice: num NA NA NA NA NA NA NA NA NA NA ...
## $ bricksetURL : chr "https://brickset.com/sets/1086-1" "https://brickset.com/sets/1177-1" "https://brickset.com/sets/1196-1" "https://brickset.com/sets/1197-1" ...
## $ thumbnailURL : chr "https://images.brickset.com/sets/small/1086-1.jpg" "https://images.brickset.com/sets/small/1177-1.jpg" "https://images.brickset.com/sets/small/1196-1.jpg" "https://images.brickset.com/sets/small/1197-1.jpg" ...
## $ imageURL : chr "https://images.brickset.com/sets/images/1086-1.jpg" "https://images.brickset.com/sets/images/1177-1.jpg" "https://images.brickset.com/sets/images/1196-1.jpg" "https://images.brickset.com/sets/images/1197-1.jpg" ...
Product Trends: How has the average price and average number of pieces of LEGO sets changed over the decade, year-by-year, from 2000 to 2009?
Price-to-Piece Ratio: What is the average price-per-piece (a measure of value) for sets within the different LEGO Themes? Which themes represent the highest and lowest value?
Theme Popularity: Which LEGO Themes (e.g., Star Wars, City) were the most popular (released the highest number of unique sets) during this 10-year period?