Datasets

Hands-on analysis of actual data is the best way to learn R programming. This page contains some data sets that you can use to explore what you are learning in this course as well as new things you might want to try. For each data set, a brief description, as well as download instructions, are provided. We have included some suggestions of analysis you could do with each data set, but you can do whatever you want with the data. The only goal is to exercise your R muscles!

If your group decides so, you are welcome to use a different data set not included in this page. It’s up to you!

Try to focus on using the tools from the course to explore the data, rather than worrying about producing a perfect report.

On the last day you will share your Rmd file (or rather, the resulting html report) with the class so we can discuss what your data was about.


Palmer penguins 🐧

Download instructions
penguins <- read.table("https://vincentarelbundock.github.io/Rdatasets/csv/heplots/peng.csv", header = T, sep = ",")
str(penguins)
## 'data.frame':    333 obs. of  9 variables:
##  $ rownames      : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ species       : chr  "Adelie" "Adelie" "Adelie" "Adelie" ...
##  $ island        : chr  "Torgersen" "Torgersen" "Torgersen" "Torgersen" ...
##  $ bill_length   : num  39.1 39.5 40.3 36.7 39.3 38.9 39.2 41.1 38.6 34.6 ...
##  $ bill_depth    : num  18.7 17.4 18 19.3 20.6 17.8 19.6 17.6 21.2 21.1 ...
##  $ flipper_length: int  181 186 195 193 190 181 195 182 191 198 ...
##  $ body_mass     : int  3750 3800 3250 3450 3650 3625 4675 3200 3800 4400 ...
##  $ sex           : chr  "m" "f" "f" "f" ...
##  $ year          : int  2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 ...
Analysis suggestions
  • Species Differentiation: How do the mean and distribution of the three key body measurements (culmen length, flipper length, and body mass) compare across the three different penguin species? (Focus on boxplots and descriptive statistics).

  • Bivariate Relationships: Is there a linear relationship (correlation) between flipper length and body mass? Does the strength or direction of this relationship appear to be different for male versus female penguins?

  • Grouping/Summary: What is the average body mass for each combination of species and island? Which species/island combination has the largest or smallest average mass?


Drinking habits 🍷

Download instructions
library(dplyr)
# this will download the csv file directly from the web
drinks <- read.table("https://vincentarelbundock.github.io/Rdatasets/csv/stevedata/nesarc_drinkspd.csv", header = T, sep = ",")
# the lines below will take a sample from the full data set
set.seed(seed = 2)
drinks <- sample_n(drinks, size = 3000, replace = F)
# and here we check the structure of the data
str(drinks)
## 'data.frame':    3000 obs. of  9 variables:
##  $ rownames : int  11014 36044 15657 11851 14800 4914 25399 9864 32033 23401 ...
##  $ idnum    : int  11014 36044 15657 11851 14800 4914 25399 9864 32033 23401 ...
##  $ ethrace2a: int  5 1 1 1 1 2 1 5 1 1 ...
##  $ region   : int  3 1 2 1 2 3 3 2 4 1 ...
##  $ age      : int  56 55 42 84 18 42 72 39 23 61 ...
##  $ sex      : int  1 1 1 1 1 1 1 1 0 1 ...
##  $ marital  : int  1 4 4 3 6 1 3 6 1 5 ...
##  $ educ     : int  4 4 3 1 2 2 2 2 2 3 ...
##  $ s2aq8b   : int  NA 1 3 NA NA NA NA NA 4 NA ...
Analysis suggestions
  • Drinking Frequency: What is the distribution of drinking frequency (or a similar intensity measure) across major demographic categories like sex, age group, or race/ethnicity? (Focus on bar charts or grouped histograms).

  • Comparison of Years: Has the proportion of heavy drinkers (or a specific drinking pattern) changed significantly between the 2001 and 2002 survey years?

  • Socioeconomic Factors: Are specific socioeconomic indicators (e.g., income, education level—if available) associated with a higher likelihood of being a drinker or a heavy drinker?


Car crashes 🚗

Download instructions
library(dplyr)
# this will download the csv file directly from the web
crashes <- read.table("https://vincentarelbundock.github.io/Rdatasets/csv/DAAG/nassCDS.csv", header = T, sep = ",")
# the lines below will take a sample from the full data set
set.seed(seed = 2)
crashes <- sample_n(crashes, size = 3000, replace = F)
# and here we check the structure of the data
str(crashes)
## 'data.frame':    3000 obs. of  16 variables:
##  $ rownames   : int  12117 13263 4806 11014 8465 21853 3276 3453 15657 17074 ...
##  $ dvcat      : chr  "10-24" "25-39" "10-24" "10-24" ...
##  $ weight     : num  5363.9 29.5 107.4 194.2 98.9 ...
##  $ dead       : chr  "alive" "alive" "alive" "alive" ...
##  $ airbag     : chr  "none" "none" "none" "none" ...
##  $ seatbelt   : chr  "belted" "belted" "none" "belted" ...
##  $ frontal    : int  0 1 0 0 1 1 1 1 0 0 ...
##  $ sex        : chr  "f" "m" "f" "f" ...
##  $ ageOFocc   : int  20 38 20 59 40 19 30 43 39 31 ...
##  $ yearacc    : int  1999 2000 1998 1999 1999 2002 1997 1997 2000 2000 ...
##  $ yearVeh    : int  1984 1990 1991 1985 1996 2001 1989 1994 1990 1983 ...
##  $ abcat      : chr  "unavail" "unavail" "unavail" "unavail" ...
##  $ occRole    : chr  "driver" "driver" "pass" "driver" ...
##  $ deploy     : int  0 0 0 0 1 1 0 1 0 0 ...
##  $ injSeverity: int  0 2 0 3 1 0 3 2 3 3 ...
##  $ caseid     : chr  "75:85:2" "4:115:1" "9:1:2" "48:61:1" ...
Analysis suggestions
  • Trend Analysis: How has the total number of sampled car accidents or fatalities (if severity is recorded) changed year-over-year from 1997 to 2002?

  • Causal Factors: What is the relationship between seatbelt use and the severity of injury (if available)? Does this relationship hold true across different driver age groups?

  • Temporal Patterns: Are there specific days of the week or times of day (if available) that have a disproportionately high number of accidents or fatalities?


Gapminder health and wealth 📈

Download instructions
library(dplyr)
# this will download the csv file directly from the web
gapminder <- read.table("https://vincentarelbundock.github.io/Rdatasets/csv/dslabs/gapminder.csv", header = T, sep = ",")
# here we filter the data to remove anything before the year 2000
gapminder <- gapminder |> filter(year >= 2000)
# and here we check the structure of the data
str(gapminder)
## 'data.frame':    1520 obs. of  10 variables:
##  $ rownames        : int  7401 7402 7403 7404 7405 7406 7407 7408 7409 7410 ...
##  $ country         : chr  "Albania" "Algeria" "Angola" "Antigua and Barbuda" ...
##  $ year            : int  2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 ...
##  $ infant_mortality: num  23.2 33.9 128.3 13.8 18 ...
##  $ life_expectancy : num  74.7 73.3 52.3 73.8 74.2 ...
##  $ fertility       : num  2.38 2.51 6.84 2.32 2.48 1.3 1.87 1.76 1.37 2.05 ...
##  $ population      : int  3121965 31183658 15058638 77648 37057453 3076098 90858 19107251 8050884 8117742 ...
##  $ gdp             : num  3.69e+09 5.48e+10 9.13e+09 8.03e+08 2.84e+11 ...
##  $ continent       : chr  "Europe" "Africa" "Africa" "Americas" ...
##  $ region          : chr  "Southern Europe" "Northern Africa" "Middle Africa" "Caribbean" ...
Analysis suggestions
  • Wealth and Health Correlation: What is the nature and strength of the relationship between a country’s GDP per capita and its life expectancy? Visualize this relationship, perhaps grouping or coloring by continent.

  • Regional Change Over Time: Which continent has seen the most significant average increase in life expectancy between the years 2000 and 2016?

  • Outlier Identification: Identify the top 5 countries with the highest life expectancy and the top 5 countries with the lowest GDP per capita in the most recent year (2016).


StackOverflow survey 🖥️

Download instructions
library(dplyr)
# this will download the csv file directly from the web
stackoverflow <- read.table("https://vincentarelbundock.github.io/Rdatasets/csv/modeldata/stackoverflow.csv", header = T, sep = ",")
# the lines below will take a sample from the full data set
set.seed(2)
stackoverflow <- sample_n(stackoverflow, size = 3000)
# and here we check the structure of the data
str(stackoverflow)
## 'data.frame':    3000 obs. of  22 variables:
##  $ rownames                            : int  3925 5071 4806 2822 4512 4488 273 5469 3276 3453 ...
##  $ Country                             : chr  "Germany" "United States" "United States" "United States" ...
##  $ Salary                              : num  80645 135000 85000 127000 4405 ...
##  $ YearsCodedJob                       : int  19 20 5 20 2 4 3 3 5 3 ...
##  $ OpenSource                          : int  1 1 1 1 0 0 0 1 0 0 ...
##  $ Hobby                               : int  1 1 1 1 0 1 1 1 0 1 ...
##  $ CompanySizeNumber                   : int  10000 10000 5000 20 1 100 1000 20 1000 10000 ...
##  $ Remote                              : chr  "Not remote" "Not remote" "Not remote" "Remote" ...
##  $ CareerSatisfaction                  : int  10 8 7 7 8 6 8 5 8 8 ...
##  $ Data_scientist                      : int  1 0 0 0 0 0 0 0 0 0 ...
##  $ Database_administrator              : int  1 0 0 0 0 0 0 0 0 0 ...
##  $ Desktop_applications_developer      : int  0 0 0 0 0 0 0 0 1 1 ...
##  $ Developer_with_stats_math_background: int  1 0 0 0 0 0 0 0 0 0 ...
##  $ DevOps                              : int  0 1 0 0 0 0 0 0 0 0 ...
##  $ Embedded_developer                  : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Graphic_designer                    : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Graphics_programming                : int  1 0 0 0 0 0 0 0 0 0 ...
##  $ Machine_learning_specialist         : int  1 0 0 0 0 0 0 0 0 0 ...
##  $ Mobile_developer                    : int  1 0 0 0 1 0 0 0 0 0 ...
##  $ Quality_assurance_engineer          : int  1 0 0 0 0 0 0 0 0 0 ...
##  $ Systems_administrator               : int  1 0 0 0 0 0 0 0 0 0 ...
##  $ Web_developer                       : int  1 1 0 0 1 1 1 1 1 0 ...
Analysis suggestions
  • Salary Determinants: How does the distribution of salary vary based on the respondent’s years of professional coding experience and/or their highest level of formal education?

  • Language Popularity: Which primary programming languages (if listed) are most commonly used by the respondents? Do the salaries earned by developers who use these languages differ significantly?

  • Job Satisfaction: Is there a correlation between job satisfaction/happiness (if available) and company size or the amount of remote work (if available)?


Doctor visits 🤒

Download instructions
library(dplyr)
# this will download the csv file directly from the web
doctor <- read.table("https://vincentarelbundock.github.io/Rdatasets/csv/AER/DoctorVisits.csv", header = T, sep = ",")
# the lines below will take a sample from the full data set
set.seed(2)
doctor <- sample_n(doctor, size = 3000)
# and here we check the structure of the data
str(doctor)
## 'data.frame':    3000 obs. of  13 variables:
##  $ rownames : int  3925 5071 4806 2822 4512 4488 273 3276 3453 690 ...
##  $ visits   : int  0 0 0 0 0 0 1 0 0 1 ...
##  $ gender   : chr  "male" "female" "male" "male" ...
##  $ age      : num  0.27 0.67 0.32 0.19 0.22 0.22 0.22 0.37 0.67 0.62 ...
##  $ income   : num  0.15 0.25 0.9 0.65 0.35 0.75 0.75 0.25 0.35 0.25 ...
##  $ illness  : int  0 1 0 1 0 1 0 1 1 5 ...
##  $ reduced  : int  0 0 0 0 0 0 0 14 0 2 ...
##  $ health   : int  0 0 0 0 0 0 1 4 0 2 ...
##  $ private  : chr  "no" "no" "yes" "yes" ...
##  $ freepoor : chr  "no" "no" "no" "no" ...
##  $ freerepat: chr  "no" "yes" "no" "no" ...
##  $ nchronic : chr  "yes" "yes" "no" "no" ...
##  $ lchronic : chr  "no" "no" "no" "no" ...
Analysis suggestions
  • Frequency and Demographics: What is the average number of doctor visits and how does this average compare across different income levels or health insurance statuses (if available)?

  • Predictive Factors: What demographic and health-related variables (e.g., chronic conditions, age, sex) appear to be the strongest predictors of a higher frequency of doctor visits?

  • Time Comparison: Was there a noticeable change in the average number of doctor visits or the proportion of people who visited a doctor between the survey years 1977 and 1978?


Video Game Sales 🎮

Download instructions
library(dplyr)
library(lubridate)
# this will download the file to your working directory
download.file(url = "https://maven-datasets.s3.amazonaws.com/Video+Game+Sales/Video+Game+Sales.zip", destfile = "video_game_sales.zip")
# this will unzip the file and read it into R
videogames <- read.table(unz(filename = "vgchartz-2024.csv", "video_game_sales.zip"), header = T, sep = ",", quote = "\"", fill = T)
# this will select rows corresponding to years 2001 and 2002
videogames <- filter(videogames, year(as_date(release_date)) %in% c(2001,2002))
# and here we check the structure of the data
str(videogames)
## 'data.frame':    3201 obs. of  14 variables:
##  $ img         : chr  "/games/boxart/827563ccc.jpg" "/games/boxart/3570928ccc.jpg" "/games/boxart/7583871ccc.jpg" "/games/boxart/9261584ccc.jpg" ...
##  $ title       : chr  "Grand Theft Auto: Vice City" "Grand Theft Auto III" "Medal of Honor: Frontline" "Crash Bandicoot: The Wrath of Cortex" ...
##  $ console     : chr  "PS2" "PS2" "PS2" "PS2" ...
##  $ genre       : chr  "Action" "Action" "Shooter" "Platform" ...
##  $ publisher   : chr  "Rockstar Games" "Rockstar Games" "Electronic Arts" "Universal Interactive" ...
##  $ developer   : chr  "Rockstar North" "DMA Design" "EA Los Angeles" "Traveller's Tales" ...
##  $ critic_score: num  9.6 9.5 9 6.9 8.3 8.2 9.1 NA 9.4 7.3 ...
##  $ total_sales : num  16.15 13.1 6.83 5.42 4.67 ...
##  $ na_sales    : num  8.41 6.99 2.93 2.07 1.94 2.71 2.66 3 3.36 2.03 ...
##  $ jp_sales    : num  0.47 0.3 0.17 0.24 0.08 0.03 0.01 0.05 0.01 NA ...
##  $ pal_sales   : num  5.49 4.51 2.75 2.29 1.95 1.51 1.29 1.11 0.21 1.56 ...
##  $ other_sales : num  1.78 1.3 0.99 0.82 0.7 0.23 0.46 0.07 0.56 0.17 ...
##  $ release_date: chr  "2002-10-28" "2001-10-23" "2002-05-28" "2001-10-29" ...
##  $ last_update : chr  "" "" "" "" ...
Analysis suggestions
  • Genre/Platform Dominance: Which video game genre and which console platform had the highest global sales across 2001 and 2002? Was there a significant shift in the leading genre/platform from 2001 to 2002?

  • Performance Metrics: Is there a correlation between a game’s sales and its user/critic score? Visualize this relationship with a scatter plot.

  • Publisher Success: Which publishers released the most games in the 2001-2002 period, and which publisher generated the highest total global sales?


LEGO Sets 🏗️

Download instructions
library(dplyr)
# this will download the file to your working directory
download.file(url = "https://maven-datasets.s3.amazonaws.com/LEGO+Sets/LEGO+Sets.zip", destfile = "lego.csv.zip")
# this will unzip the file and read it into R
lego <- read.table(unz(filename = "lego_sets.csv", "lego.csv.zip"), header = T, sep = ",", quote = "\"", fill = T)
# this will select rows corresponding to years 2000-2009
lego <- filter(lego, year %in% seq(2000,2009,1))
# and here we check the structure of the data
str(lego)
## 'data.frame':    4304 obs. of  14 variables:
##  $ set_id        : chr  "1086-1" "1177-1" "1196-1" "1197-1" ...
##  $ name          : chr  "Bulk Bucket" "Santa's Truck" "Telekom Race Cyclist" "Telekom Race Cyclist and Television Motorbike" ...
##  $ year          : int  2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 ...
##  $ theme         : chr  "Duplo" "Town" "Town" "Town" ...
##  $ subtheme      : chr  "" "Special" "Telekom" "Telekom" ...
##  $ themeGroup    : chr  "Pre-school" "Modern day" "Modern day" "Modern day" ...
##  $ category      : chr  "Normal" "Normal" "Normal" "Normal" ...
##  $ pieces        : int  48 27 7 26 81 129 10 27 26 23 ...
##  $ minifigs      : int  NA 1 1 3 3 8 2 NA NA 1 ...
##  $ agerange_min  : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ US_retailPrice: num  NA NA NA NA NA NA NA NA NA NA ...
##  $ bricksetURL   : chr  "https://brickset.com/sets/1086-1" "https://brickset.com/sets/1177-1" "https://brickset.com/sets/1196-1" "https://brickset.com/sets/1197-1" ...
##  $ thumbnailURL  : chr  "https://images.brickset.com/sets/small/1086-1.jpg" "https://images.brickset.com/sets/small/1177-1.jpg" "https://images.brickset.com/sets/small/1196-1.jpg" "https://images.brickset.com/sets/small/1197-1.jpg" ...
##  $ imageURL      : chr  "https://images.brickset.com/sets/images/1086-1.jpg" "https://images.brickset.com/sets/images/1177-1.jpg" "https://images.brickset.com/sets/images/1196-1.jpg" "https://images.brickset.com/sets/images/1197-1.jpg" ...
Analysis suggestions
  • Product Trends: How has the average price and average number of pieces of LEGO sets changed over the decade, year-by-year, from 2000 to 2009?

  • Price-to-Piece Ratio: What is the average price-per-piece (a measure of value) for sets within the different LEGO Themes? Which themes represent the highest and lowest value?

  • Theme Popularity: Which LEGO Themes (e.g., Star Wars, City) were the most popular (released the highest number of unique sets) during this 10-year period?


Shark attacks 🦈

Download instructions
library(dplyr)
# this will download the file to your working directory
download.file(url = "https://maven-datasets.s3.amazonaws.com/Shark+Attacks/attacks.csv.zip", destfile = "attacks.csv.zip")
# this will unzip the file and read it into R
sharks <- read.table(unz(filename = "attacks.csv", "attacks.csv.zip"), header = T, sep = ",", quote = "\"", fill = T)
# the lines below will take a sample from the full data set
set.seed(seed = 2)
sharks <- sample_n(sharks, size = 3000, replace = F)
str(sharks)
## 'data.frame':    3000 obs. of  22 variables:
##  $ Case.Number           : chr  "" "" "1934.01.08.R" "" ...
##  $ Date                  : chr  "" "" "Reported 08-Feb-1934" "" ...
##  $ Year                  : int  NA NA 1934 NA NA NA 1969 1966 NA NA ...
##  $ Type                  : chr  "" "" "Boating" "" ...
##  $ Country               : chr  "" "" "TURKEY" "" ...
##  $ Area                  : chr  "" "" "Istanbul" "" ...
##  $ Location              : chr  "" "" "Haydarpasa jetty, Istanbul" "" ...
##  $ Activity              : chr  "" "" "Fishing" "" ...
##  $ Name                  : chr  "" "" "2 males" "" ...
##  $ Sex                   : chr  "" "" "M" "" ...
##  $ Age                   : chr  "" "" "" "" ...
##  $ Injury                : chr  "" "" "No injury" "" ...
##  $ Fatal..Y.N.           : chr  "" "" "N" "" ...
##  $ Time                  : chr  "" "" "" "" ...
##  $ Species               : chr  "" "" "" "" ...
##  $ Investigator.or.Source: chr  "" "" "C. Moore, GSAF" "" ...
##  $ pdf                   : chr  "" "" "1924.02.08.R-Turkey.pdf" "" ...
##  $ href.formula          : chr  "" "" "http://sharkattackfile.net/spreadsheets/pdf_directory/1924.02.08.R-Turkey.pdf" "" ...
##  $ href                  : chr  "" "" "http://sharkattackfile.net/spreadsheets/pdf_directory/http://sharkattackfile.net/spreadsheets/pdf_directory/192"| __truncated__ "" ...
##  $ Case.Number.1         : chr  "" "" "1934.02.08.R" "" ...
##  $ Case.Number.2         : chr  "" "" "1934.02.08.R" "" ...
##  $ original.order        : int  NA NA 1290 NA NA NA 2820 2643 NA NA ...
Analysis suggestions
  • Geographical Hotspots: Which countries or regions have the highest recorded number of shark attacks? Visualize the top 10 locations.

  • Context and Activity: What are the most common activities being performed by victims at the time of the attack (e.g., swimming, surfing)? Do the outcomes/severity of the attack differ based on the victim’s activity?

  • Temporal Patterns: Is there a noticeable pattern in the number of attacks across the months or seasons (e.g., is there a “shark season”)?


Sample project report