A data set that has more than one dimension is conceptually hard to store as a vector. For these two-dimensional data sets the solution is to instead use matrices or data frames. As with vectors, all values in a matrix have to be of the same type (e.g. you cannot mix characters and numerics in the same matrix. I mean, you can, but they will get coerced to characters). For data frames this homogeneity is not a requirement and different columns can have different data types, but all columns in a data frame must have the same number of entries. In addition to these, R also have objects named lists that can store any type of data set and are not restricted by types or dimensions.
In this exercise you will learn how to:
The command to create a matrix in R is matrix(). As input it takes a vector of values, the number of rows and/or the number of columns.
X <- matrix(1:12, nrow = 4, ncol = 3)
X
## [,1] [,2] [,3]
## [1,] 1 5 9
## [2,] 2 6 10
## [3,] 3 7 11
## [4,] 4 8 12
Note that if you only specify the number of rows or the number of columns, but not both, R will infer the size of the matrix automatically using the size of the input vector and the option given. The default way of filling the matrix is column-wise, so the first values from the vector ends up in column 1 of the matrix. If you instead wants to fill the matrix row by row you can set the byrow flag to TRUE.
X <- matrix(1:12, nrow = 4, ncol = 3, byrow = TRUE)
X
## [,1] [,2] [,3]
## [1,] 1 2 3
## [2,] 4 5 6
## [3,] 7 8 9
## [4,] 10 11 12
Subsetting a matrix is done the same way as for vectors, but you have two dimensions to specify. So you specify the rows and columns you want.
X[1,2]
## [1] 2
If one wants all values in a column or a row this can be specified by leaving the other dimension empty. For e.g. this code will print all values in the second column.
X[,2]
## [1] 2 5 8 11
Note that if the retrieved part of a matrix can be represented as a vector (e.g. has a single dimension) R will convert it to a vector. Otherwise it will still be a matrix.
Create a matrix containing the numbers 1 through 12 with 4 rows and 3 columns, similar to the matrix X shown above.
length(X)
## [1] 12
X[X>6]
## [1] 7 10 8 11 9 12
X[,c(3,2,1)]
## [,1] [,2] [,3]
## [1,] 3 2 1
## [2,] 6 5 4
## [3,] 9 8 7
## [4,] 12 11 10
rbind to add rows to a matrix, or cbind to add columns. How would you add a vector with three zeros as a fifth row to the matrix?
X.2 <- rbind(X, rep(0, 3))
X.2
## [,1] [,2] [,3]
## [1,] 1 2 3
## [2,] 4 5 6
## [3,] 7 8 9
## [4,] 10 11 12
## [5,] 0 0 0
NA.
X[,1:2] <- NA
X
## [,1] [,2] [,3]
## [1,] NA NA 3
## [2,] NA NA 6
## [3,] NA NA 9
## [4,] NA NA 12
X[] <- 0
as.vector(X)
## [1] 0 0 0 0 0 0 0 0 0 0 0 0
outer() that generates matrices based on the combination of two datasets. Try to generate the same vector as before, but this time using outer(). This function is very powerful, but can be hard to wrap your head around, so try to follow the logic, perhaps by creating a simple example to start with.
letnum <- outer(paste("Geno",letters[1:19], sep = "_"), 1:3, paste, sep = "_")
class(letnum)
sort(as.vector(letnum))
## [1] "matrix" "array"
## [1] "Geno_a_1" "Geno_a_2" "Geno_a_3" "Geno_b_1" "Geno_b_2" "Geno_b_3"
## [7] "Geno_c_1" "Geno_c_2" "Geno_c_3" "Geno_d_1" "Geno_d_2" "Geno_d_3"
## [13] "Geno_e_1" "Geno_e_2" "Geno_e_3" "Geno_f_1" "Geno_f_2" "Geno_f_3"
## [19] "Geno_g_1" "Geno_g_2" "Geno_g_3" "Geno_h_1" "Geno_h_2" "Geno_h_3"
## [25] "Geno_i_1" "Geno_i_2" "Geno_i_3" "Geno_j_1" "Geno_j_2" "Geno_j_3"
## [31] "Geno_k_1" "Geno_k_2" "Geno_k_3" "Geno_l_1" "Geno_l_2" "Geno_l_3"
## [37] "Geno_m_1" "Geno_m_2" "Geno_m_3" "Geno_n_1" "Geno_n_2" "Geno_n_3"
## [43] "Geno_o_1" "Geno_o_2" "Geno_o_3" "Geno_p_1" "Geno_p_2" "Geno_p_3"
## [49] "Geno_q_1" "Geno_q_2" "Geno_q_3" "Geno_r_1" "Geno_r_2" "Geno_r_3"
## [55] "Geno_s_1" "Geno_s_2" "Geno_s_3"
A. A * B
B. A / B
C. A + B
D. A - B
E. A == B
A <- matrix(1:4, ncol = 2, nrow = 2)
B <- matrix(5:8, ncol = 2, nrow = 2)
A
B
A * B
A / B
A %x% B
A + B
A - B
A == B
## [,1] [,2]
## [1,] 1 3
## [2,] 2 4
## [,1] [,2]
## [1,] 5 7
## [2,] 6 8
## [,1] [,2]
## [1,] 5 21
## [2,] 12 32
## [,1] [,2]
## [1,] 0.2000000 0.4285714
## [2,] 0.3333333 0.5000000
## [,1] [,2] [,3] [,4]
## [1,] 5 7 15 21
## [2,] 6 8 18 24
## [3,] 10 14 20 28
## [4,] 12 16 24 32
## [,1] [,2]
## [1,] 6 10
## [2,] 8 12
## [,1] [,2]
## [1,] -4 -4
## [2,] -4 -4
## [,1] [,2]
## [1,] FALSE FALSE
## [2,] FALSE FALSE
e <- rnorm(n = 100)
E <- matrix(e, nrow = 10, ncol = 10)
colnames(E) <- LETTERS[1:10]
rownames(E) <- colnames(E)
E.means <- rowMeans(E)
E.medians <- apply(E, MARGIN = 1, median)
E.mm <- rbind(E.means, E.medians)
E.mm
## A B C D E F
## E.means 0.1434262 0.2091703 0.5589836 -0.04060708 -0.6131914 -0.08229978
## E.medians 0.2233098 0.4382664 0.3811558 0.10249377 -0.4915868 0.02414141
## G H I J
## E.means -0.6098872 0.1120556 -0.2061786 0.09129080
## E.medians -0.6181357 0.1695135 -0.4958519 -0.02914336
Even though vectors are the basic data structures of R, data frames are very central as they are the most common way to import data into R (e.g. read.table() will create a data frame). A data frame consists of a set of equally long vectors. As data frames can contain several different data types the command str() is very useful to get an overview of data frames.
vector1 <- 1:10
vector2 <- letters[1:10]
vector3 <- rnorm(10, sd = 10)
dfr <- data.frame(vector1, vector2, vector3)
str(dfr)
## 'data.frame': 10 obs. of 3 variables:
## $ vector1: int 1 2 3 4 5 6 7 8 9 10
## $ vector2: chr "a" "b" "c" "d" ...
## $ vector3: num -13.4257 -1.256 -6.4979 11.1638 -0.0387 ...
In the above example, we can see that the dataframe dfr contains 10 observations for three variables that all have different classes, column 1 is an integer vector, column 2 a character vector, and column 3 a numeric vector.
dim(dfr)
# or
ncol(dfr)
nrow(dfr)
## [1] 10 3
## [1] 3
## [1] 10
dfr[,2:3]
dfr[,c("vector2", "vector3")]
## vector2 vector3
## 1 a -13.42565884
## 2 b -1.25597985
## 3 c -6.49789570
## 4 d 11.16377816
## 5 e -0.03869153
## 6 f -6.60817816
## 7 g 1.37472456
## 8 h -6.92990598
## 9 i 4.66631109
## 10 j 20.20510919
## vector2 vector3
## 1 a -13.42565884
## 2 b -1.25597985
## 3 c -6.49789570
## 4 d 11.16377816
## 5 e -0.03869153
## 6 f -6.60817816
## 7 g 1.37472456
## 8 h -6.92990598
## 9 i 4.66631109
## 10 j 20.20510919
dfr[dfr$vector3>0,2]
dfr$vector2[dfr$vector3>0]
## [1] "d" "g" "i" "j"
## [1] "d" "g" "i" "j"
paste(dfr$vector1, dfr$vector2, dfr$vector3, sep = "_")
## [1] "1_a_-13.4256588384262" "2_b_-1.25597984776342"
## [3] "3_c_-6.4978956986819" "4_d_11.1637781560534"
## [5] "5_e_-0.0386915314866101" "6_f_-6.60817815789251"
## [7] "7_g_1.37472455682968" "8_h_-6.92990597543987"
## [9] "9_i_4.66631109202722" "10_j_20.2051091866208"
mtcars. How many rows and columns does it have?
dim(mtcars)
ncol(mtcars)
nrow(mtcars)
## [1] 32 11
## [1] 11
## [1] 32
car.names <- sample(row.names(mtcars))
random1 <- rnorm(length(car.names))
random2 <- rnorm(length(car.names))
mtcars2 <- data.frame(car.names, random1, random2)
mtcars2
## car.names random1 random2
## 1 Datsun 710 -0.65503687 0.40459965
## 2 AMC Javelin 1.43242935 0.65408022
## 3 Lincoln Continental -0.35590318 0.24293414
## 4 Merc 450SL 0.34840088 -1.29578549
## 5 Merc 280C 0.50193686 -1.78425498
## 6 Honda Civic 0.22541096 -0.15556577
## 7 Maserati Bora 0.47106078 -1.14714275
## 8 Mazda RX4 Wag -0.99375806 -0.55287633
## 9 Merc 280 -0.08950694 0.01759031
## 10 Valiant 0.49443472 1.67907465
## 11 Fiat 128 -0.49862921 1.01026506
## 12 Hornet Sportabout 1.80969189 1.49405582
## 13 Cadillac Fleetwood -0.21836316 -0.77846158
## 14 Merc 240D -0.71631390 1.22859733
## 15 Volvo 142E -0.94051005 1.44508490
## 16 Toyota Corolla -0.08698966 -0.31116887
## 17 Hornet 4 Drive 0.84100793 -0.03466209
## 18 Porsche 914-2 0.10822416 0.64162530
## 19 Mazda RX4 1.75357850 0.96582653
## 20 Camaro Z28 -0.84979956 0.54241824
## 21 Duster 360 1.75334057 -0.28640116
## 22 Lotus Europa -3.54636132 0.33586350
## 23 Toyota Corona 0.06196593 0.93037512
## 24 Ferrari Dino -1.22240344 -1.17627798
## 25 Pontiac Firebird -1.72047620 -0.90435780
## 26 Dodge Challenger 0.47009754 0.55711990
## 27 Fiat X1-9 -0.02832867 1.17904950
## 28 Merc 450SE 0.91586850 -0.20102645
## 29 Merc 230 -0.14118661 -0.13863472
## 30 Ford Pantera L -1.55288580 -0.41300498
## 31 Chrysler Imperial -1.99340531 0.81737316
## 32 Merc 450SLC 1.52472164 -1.16172991
mt.merged <- merge(mtcars, mtcars2, by.x = "row.names", by.y = "car.names")
mt.merged
## Row.names mpg cyl disp hp drat wt qsec vs am gear carb
## 1 AMC Javelin 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2
## 2 Cadillac Fleetwood 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4
## 3 Camaro Z28 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4
## 4 Chrysler Imperial 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4
## 5 Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
## 6 Dodge Challenger 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 2
## 7 Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
## 8 Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6
## 9 Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
## 10 Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1
## 11 Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4
## 12 Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
## 13 Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
## 14 Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
## 15 Lincoln Continental 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4
## 16 Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
## 17 Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8
## 18 Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
## 19 Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
## 20 Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
## 21 Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
## 22 Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
## 23 Merc 280C 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4
## 24 Merc 450SE 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3
## 25 Merc 450SL 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3
## 26 Merc 450SLC 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3
## 27 Pontiac Firebird 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2
## 28 Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2
## 29 Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
## 30 Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1
## 31 Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
## 32 Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2
## random1 random2
## 1 1.43242935 0.65408022
## 2 -0.21836316 -0.77846158
## 3 -0.84979956 0.54241824
## 4 -1.99340531 0.81737316
## 5 -0.65503687 0.40459965
## 6 0.47009754 0.55711990
## 7 1.75334057 -0.28640116
## 8 -1.22240344 -1.17627798
## 9 -0.49862921 1.01026506
## 10 -0.02832867 1.17904950
## 11 -1.55288580 -0.41300498
## 12 0.22541096 -0.15556577
## 13 0.84100793 -0.03466209
## 14 1.80969189 1.49405582
## 15 -0.35590318 0.24293414
## 16 -3.54636132 0.33586350
## 17 0.47106078 -1.14714275
## 18 1.75357850 0.96582653
## 19 -0.99375806 -0.55287633
## 20 -0.14118661 -0.13863472
## 21 -0.71631390 1.22859733
## 22 -0.08950694 0.01759031
## 23 0.50193686 -1.78425498
## 24 0.91586850 -0.20102645
## 25 0.34840088 -1.29578549
## 26 1.52472164 -1.16172991
## 27 -1.72047620 -0.90435780
## 28 0.10822416 0.64162530
## 29 -0.08698966 -0.31116887
## 30 0.06196593 0.93037512
## 31 0.49443472 1.67907465
## 32 -0.94051005 1.44508490
colMeans().
colMeans(mtcars2[, c("random1", "random2")])
## random1 random2
## -0.09055274 0.11889320
The last data structure that we will explore are lists, which are very flexible data structures. Lists can combine elements of different types and they do not have to be of equal dimensions. The elements of a list can be pretty much anything, including vectors, matrices, data frames, and even other lists. The drawback with a flexible structure is that it requires a bit more work to interact with.
The syntax to create a list is similar to creation of the other data structures in R.
l <- list(1, 2, 3)
As with the data frames the str() command is very useful for the sometimes fairly complex lists instances.
str(l)
## List of 3
## $ : num 1
## $ : num 2
## $ : num 3
This example of a list containing only a numeric vector is not very exciting, so let’s create a more complex example.
vec1 <- letters
vec2 <- 1:4
mat1 <- matrix(1:100, nrow = 5)
df1 <- as.data.frame(cbind(10:1, 91:100))
mylist <- list(vec1, vec2, mat1, df1, l)
As you can see a list can not only contain other data structures, but can also contain other lists.
Looking at the str() command reveals much of the details of a list
str(mylist)
## List of 5
## $ : chr [1:26] "a" "b" "c" "d" ...
## $ : int [1:4] 1 2 3 4
## $ : int [1:5, 1:20] 1 2 3 4 5 6 7 8 9 10 ...
## $ :'data.frame': 10 obs. of 2 variables:
## ..$ V1: int [1:10] 10 9 8 7 6 5 4 3 2 1
## ..$ V2: int [1:10] 91 92 93 94 95 96 97 98 99 100
## $ :List of 3
## ..$ : num 1
## ..$ : num 2
## ..$ : num 3
With this more complex object, subsetting/selecting is slightly trickier than with the other more homogeneous objects we have looked at so far.
You can think of the R List as a Pea Pod.
The core of list subsetting is understanding the difference between the two operators, [] and [[]]. The single square bracket operator [] is like using a kitchen knife to cut a piece off the pod.
[] always returns a new, smaller pea pod (a new list) containing the elements you selected. The integrity of the container is preserved.
| R Code | Analogy | Result |
|---|---|---|
my_list[c(1, 3)] |
Cutting out the 1st and 3rd sections of the pod. | A new list containing only the 1st and 3rd elements. |
my_list["B"] |
Cutting out the section labeled “B”. | A new list containing just the element named “B”. |
The output of this operation is still a list, which means it retains the structure and associated attributes of the original list.
mylist[1]
str(mylist[1])
## [[1]]
## [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m" "n" "o" "p" "q" "r" "s"
## [20] "t" "u" "v" "w" "x" "y" "z"
##
## List of 1
## $ : chr [1:26] "a" "b" "c" "d" ...
The double square bracket operator [[]] (or the dollar sign $) is like opening the pod and taking the pea out with your hand. [[]] allows you to access the contents of a single element, returning the pea itself (the actual data stored in that element).
mylist[[1]]
str(mylist[[1]])
## [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m" "n" "o" "p" "q" "r" "s"
## [20] "t" "u" "v" "w" "x" "y" "z"
## chr [1:26] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m" "n" "o" "p" ...
This means that the syntax to extract a specific value from a data structure stored in a list can be daunting. Below we extract the second column of a data frame stored at position 4 in the list mylist.
mylist[[4]][,2]
## [1] 91 92 93 94 95 96 97 98 99 100
list.2 <- list(vec1 = c("hi", "ho", "merry", "christmas"),
vec2 = 4:19,
mat1 = matrix(as.character(100:81),nrow = 4))
list.2
## $vec1
## [1] "hi" "ho" "merry" "christmas"
##
## $vec2
## [1] 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
##
## $mat1
## [,1] [,2] [,3] [,4] [,5]
## [1,] "100" "96" "92" "88" "84"
## [2,] "99" "95" "91" "87" "83"
## [3,] "98" "94" "90" "86" "82"
## [4,] "97" "93" "89" "85" "81"
dfr <- data.frame(letters, LETTERS, letters == LETTERS)
Add this data frame to the list created above.
list.2[[4]] <- dfr
list.2[-2]
## $vec1
## [1] "hi" "ho" "merry" "christmas"
##
## $mat1
## [,1] [,2] [,3] [,4] [,5]
## [1,] "100" "96" "92" "88" "84"
## [2,] "99" "95" "91" "87" "83"
## [3,] "98" "94" "90" "86" "82"
## [4,] "97" "93" "89" "85" "81"
##
## [[3]]
## letters LETTERS letters....LETTERS
## 1 a A FALSE
## 2 b B FALSE
## 3 c C FALSE
## 4 d D FALSE
## 5 e E FALSE
## 6 f F FALSE
## 7 g G FALSE
## 8 h H FALSE
## 9 i I FALSE
## 10 j J FALSE
## 11 k K FALSE
## 12 l L FALSE
## 13 m M FALSE
## 14 n N FALSE
## 15 o O FALSE
## 16 p P FALSE
## 17 q Q FALSE
## 18 r R FALSE
## 19 s S FALSE
## 20 t T FALSE
## 21 u U FALSE
## 22 v V FALSE
## 23 w W FALSE
## 24 x X FALSE
## 25 y Y FALSE
## 26 z Z FALSE
list.a <- list(1:10, letters[1:5], c(T,F,T,F))
apply function. You can do the same for lists using lapply.
length(list.a)
lapply(list.a, FUN = "length")
## [1] 3
## [[1]]
## [1] 10
##
## [[2]]
## [1] 5
##
## [[3]]
## [1] 4