class: center, middle, inverse, title-slide .title[ # Matrices, Data Frames, and Lists ] .subtitle[ ## R Foundations for Data Analysis ] .author[ ### Marcin Kierczak, Guilherme Dias ] --- exclude: true count: false <link href="https://fonts.googleapis.com/css?family=Roboto|Source+Sans+Pro:300,400,600|Ubuntu+Mono&subset=latin-ext" rel="stylesheet"> <link rel="stylesheet" href="https://use.fontawesome.com/releases/v5.3.1/css/all.css" integrity="sha384-mzrmE5qonljUremFsqc01SB46JvROS7bZs3IO2EmfFsd15uHvIt+Y8vEf7N7fWAU" crossorigin="anonymous"> --- name: contents # Contents of the lecture - variables and their types - operators - vectors - numbers as vectors - strings as vectors - **matrices** - **data frames** - **lists** <!-- - **objects** --> - repeating actions: iteration and recursion - decision taking: control structures - functions in general - variable scope - core functions --- name: matrices # Matrices A **matrix** is a 2-dimensional data structure. Like vectors, it consists of elements of the same type. A matrix has *rows* and *columns*. Say, we want to construct this matrix in R: `$$\mathbf{X} = \left[\begin{array} {rrr} 1 & 2 & 3 \\ 4 & 5 & 6 \\ 7 & 8 & 9 \end{array}\right]$$` ``` r X <- matrix(1:9, # a sequence of numbers to fill in nrow=3, # three rows (alt. ncol=3) byrow=T) # populate matrix by row X ``` ``` ## [,1] [,2] [,3] ## [1,] 1 2 3 ## [2,] 4 5 6 ## [3,] 7 8 9 ``` --- name: matrices_dim # Matrices — dimensions To check the dimensions of a matrix, use `dim()`: ``` r X dim(X) # 3 rows and 3 columns ``` ``` ## [,1] [,2] [,3] ## [1,] 1 2 3 ## [2,] 4 5 6 ## [3,] 7 8 9 ## [1] 3 3 ``` --- name: matrices_indexing # Matrices — indexing Elements of a matrix are retrieved using the `[]` notation. We have to specify 2 dimensions -- the rows and the columns: `$$\mathbf{X} = \left[\begin{array} {rrr} 1 & 2 & 3 \\ 4 & 5 & 6 \\ 7 & 8 & 9 \end{array}\right]$$` ``` r X[1,2] # Retrieve element from the 1st row, 2nd column X[3,] # Retrieve the entire 3rd row X[,2] # Retrieve the 2nd column ``` ``` ## [1] 2 ## [1] 7 8 9 ## [1] 2 5 8 ``` --- name: matrices_indexing_2 exclude: true # Matrices — indexing cted. `$$\mathbf{X} = \left[\begin{array} {rrr} 1 & 2 & 3 \\ 4 & 5 & 6 \\ 7 & 8 & 9 \end{array}\right]$$` ``` r X[c(1,3),] # Retrieve rows 1 and 3 X[c(1,3),c(3,1)] ``` ``` ## [,1] [,2] [,3] ## [1,] 1 2 3 ## [2,] 7 8 9 ## [,1] [,2] ## [1,] 3 1 ## [2,] 9 7 ``` --- name: matrices_oper_1 # Matrices — operations Usually the functions that work for a vector also work for matrices. To order a matrix with respect to, say, 2nd column: ``` r X <- matrix(sample(1:9,size = 9), nrow = 3) ord <- order(X[,2]) X[ord,] ``` ``` ## [,1] [,2] [,3] ## [1,] 8 3 7 ## [2,] 1 4 2 ## [3,] 6 9 5 ``` --- name: matrices_t # Matrices — transposition To **transpose** a matrix use `t()`: ``` r X t(X) ``` ``` ## [,1] [,2] [,3] ## [1,] 1 4 2 ## [2,] 8 3 7 ## [3,] 6 9 5 ## [,1] [,2] [,3] ## [1,] 1 8 6 ## [2,] 4 3 9 ## [3,] 2 7 5 ``` --- name: matrices_oper_2 exclude: true # Matrices — operations 2 To get the diagonal of the matrix: ``` r X diag(X) # get values on the diagonal ``` ``` ## [,1] [,2] [,3] ## [1,] 1 4 2 ## [2,] 8 3 7 ## [3,] 6 9 5 ## [1] 1 3 5 ``` --- name: matrices_tri exclude: true # Matrices — operations, triangles To get the upper or the lower triangle use `upper.tri()` and `lower.tri()` respectively: ``` r X # print X upper.tri(X) # which elements form the upper triangle X[upper.tri(X)] <- 0 # set them to 0 X # print the new matrix ``` ``` ## [,1] [,2] [,3] ## [1,] 1 4 2 ## [2,] 8 3 7 ## [3,] 6 9 5 ## [,1] [,2] [,3] ## [1,] FALSE TRUE TRUE ## [2,] FALSE FALSE TRUE ## [3,] FALSE FALSE FALSE ## [,1] [,2] [,3] ## [1,] 1 0 0 ## [2,] 8 3 0 ## [3,] 6 9 5 ``` --- name: matrices_multi # Matrices — multiplication Different types of matrix multiplication exist: ``` r A <- matrix(1:4, nrow = 2, byrow=T) B <- matrix(5:8, nrow = 2, byrow=T) A * B # Hadamard product A %*% B # Matrix multiplication # A %x% B # Kronecker product # A %o% B # Outer product (tensor product) ``` ``` ## [,1] [,2] ## [1,] 5 12 ## [2,] 21 32 ## [,1] [,2] ## [1,] 19 22 ## [2,] 43 50 ``` --- name: matrices_outer exclude: true # Matrices — outer Outer product can be useful for generating names ``` r outer(letters[1:4], LETTERS[1:4], paste, sep="-") ``` ``` ## [,1] [,2] [,3] [,4] ## [1,] "a-A" "a-B" "a-C" "a-D" ## [2,] "b-A" "b-B" "b-C" "b-D" ## [3,] "c-A" "c-B" "c-C" "c-D" ## [4,] "d-A" "d-B" "d-C" "d-D" ``` --- name: matrices_expand_grid exclude: true # Expand grid But `expand.grid()` is more convenient when you want, e.g. generate combinations of variable values: ``` r expand.grid(height = seq(120, 121), weight = c('1-50', '51+'), sex = c("Male","Female")) ``` ``` ## height weight sex ## 1 120 1-50 Male ## 2 121 1-50 Male ## 3 120 51+ Male ## 4 121 51+ Male ## 5 120 1-50 Female ## 6 121 1-50 Female ## 7 120 51+ Female ## 8 121 51+ Female ``` --- name: matrices_apply # Matrices — apply Function `apply` is a very useful function that applies a given function to either each value of the matrix or in a column/row-wise manner. Say, we want to have mean of values by column: ``` r X apply(X, MARGIN=2, mean) # MARGIN=1 would do it for rows ``` ``` ## [,1] [,2] [,3] ## [1,] 1 0 0 ## [2,] 8 3 0 ## [3,] 6 9 5 ## [1] 5.000000 4.000000 1.666667 ``` --- name: matrices_apply_2 exclude: true # Matrices — apply cted. And now we will use `apply()` to calculate for each element in a matrix its deviation from the mean squared: ``` r X my.mean <- mean(X) apply(X, MARGIN=c(1,2), function(x, my.mean) (x - my.mean)^2, my.mean) ``` ``` ## [,1] [,2] [,3] ## [1,] 1 0 0 ## [2,] 8 3 0 ## [3,] 6 9 5 ## [,1] [,2] [,3] ## [1,] 6.530864 12.641975 12.64198 ## [2,] 19.753086 0.308642 12.64198 ## [3,] 5.975309 29.641975 2.08642 ``` --- name: matrices_colSums # Matrices — useful fns. While `apply()` is handy, it is a bit slow and for the most common statistics, there are special functions col/row Sums/Means: ``` r X colMeans(X) ``` ``` ## [,1] [,2] [,3] ## [1,] 1 0 0 ## [2,] 8 3 0 ## [3,] 6 9 5 ## [1] 5.000000 4.000000 1.666667 ``` These functions are faster! --- name: matrices_add_row_col # Matrices — adding rows/columns To add rows or columns to a matrix; or to make a matrix out of two or more vectors of equal length: ``` r x <- c(1,1,1) y <- c(2,2,2) cbind(x,y) rbind(x,y) ``` ``` ## x y ## [1,] 1 2 ## [2,] 1 2 ## [3,] 1 2 ## [,1] [,2] [,3] ## x 1 1 1 ## y 2 2 2 ``` --- name: matrices_arrays exclude: true # Matrices — more dimensions ``` r dim(Titanic) ``` ``` ## [1] 4 2 2 2 ``` -- exclude: true ``` r library(vcd) ``` ``` ## Loading required package: grid ``` ``` r mosaic(Titanic, gp_labels=gpar(fontsize=7)) ``` <img src="slide_r_elements_3_files/figure-html/matrix.Titanic.plot-1.png" width="720" style="display: block; margin: auto auto auto 0;" /> --- name: data_frames_1 # Data frames - **Data frames** are also two-dimensional data structures. - Different columns can have different data types! - Technically, a data frame is just a list of vectors. -- <br/><br/> <center> .size-70[ ![](images/data_frame.png) ] </center> --- name: data_frames_create # Data frames — creating a data frame ``` r df <- data.frame(c(1:5), LETTERS[1:5], c(T,F,F,T,T)) df ``` ``` ## c.1.5. LETTERS.1.5. c.T..F..F..T..T. ## 1 1 A TRUE ## 2 2 B TRUE ## 3 3 C TRUE ## 4 4 D TRUE ## 5 5 E TRUE ``` --- name: data_frames_columns # Data frames — name your columns! - Always try to give meaningful names to your columns ``` r df <- data.frame(numbers=c(1:5), letters=c('a','b','c','d','e'), logical=c(T,F,F,T,T)) df ``` ``` ## numbers letters logical ## 1 1 a TRUE ## 2 2 b TRUE ## 3 3 c TRUE ## 4 4 d TRUE ## 5 5 e TRUE ``` --- name: data_frames_accessing # Data frames — accessing values - We can always use the `[row,column]` notation to access values inside data frames. ``` r df[1,] # get the first row df[,2] # the second column df[2:3, 'letters'] # get rows 2-3 from the 'letters' column ``` ``` ## numbers letters logical ## 1 1 a TRUE ## [1] "a" "b" "c" "d" "e" ## [1] "b" "c" ``` --- name: data_frames_dollar # Data frames — accessing values - We can also use dollar sign `$` to access columns ``` r df$letters # get the column named 'letters' df$letters[2:3] # get the second and third elements of the column named 'letters' ``` ``` ## [1] "a" "b" "c" "d" "e" ## [1] "b" "c" ``` --- name: data_frames_factors_1 exclude: true # Data frames — factors An interesting observation: ``` r df$letter df$letter <- as.character(df$letter) df$letter ``` ``` ## [1] "a" "b" "c" "d" "e" ## [1] "a" "b" "c" "d" "e" ``` --- name: data_frames_factors_2 exclude: true # Data frames — factors cted. To treat characters as characters at data frame creation time, one can use the **stringsAsFactors** option set to TRUE: ``` r df <- data.frame(no=c(1:5), letter=c("a","b","c","d","e"), isBrown=sample(c(TRUE, FALSE), size = 5, replace=T), stringsAsFactors = TRUE) df$letter ``` ``` ## [1] a b c d e ## Levels: a b c d e ``` Well, as you see, it did not work as expected... --- name: data_frames_names # Data frames — names To get or change row/column names: ``` r colnames(df) # get column names colnames(df) <- c('num','let','logi') # assign column names colnames(df) rownames(df) # get row names rownames(df) <- letters[1:5] # assign row names rownames(df) ``` ``` ## [1] "no" "letter" "isBrown" ## [1] "num" "let" "logi" ## [1] "1" "2" "3" "4" "5" ## [1] "a" "b" "c" "d" "e" ``` --- name: data_frames_merging # Data frames — merging We can merge two data frames on certain a key using `merge()`: ``` r age <- data.frame(ID=c(1:4), age=c(37,48,22,NA)) clinical <- data.frame(ID=c(1:4), status=c("sick","healthy","healthy","sick")) patients <- merge(age, clinical, by='ID') patients ``` ``` ## ID age status ## 1 1 37 sick ## 2 2 48 healthy ## 3 3 22 healthy ## 4 4 NA sick ``` --- name: data_frames_summarizing # Data frames — summarising To get an overview of the data in each column, use `summary()`: ``` r summary(patients) ``` ``` ## ID age status ## Min. :1.00 Min. :22.00 Length:4 ## 1st Qu.:1.75 1st Qu.:29.50 Class :character ## Median :2.50 Median :37.00 Mode :character ## Mean :2.50 Mean :35.67 ## 3rd Qu.:3.25 3rd Qu.:42.50 ## Max. :4.00 Max. :48.00 ## NA's :1 ``` --- name: data_frames_missing # Data frames — missing data We can use functions to deal with missing values: ``` r is.na(patients) # check where the NAs are na.omit(patients) # remove all rows containing NAs patients[rowSums(is.na(patients)) > 0,] # select rows containing NAs ``` ``` ## ID age status ## [1,] FALSE FALSE FALSE ## [2,] FALSE FALSE FALSE ## [3,] FALSE FALSE FALSE ## [4,] FALSE TRUE FALSE ## ID age status ## 1 1 37 sick ## 2 2 48 healthy ## 3 3 22 healthy ## ID age status ## 4 4 NA sick ``` --- name: lists_1 # Lists — collections of various data types A list is a collection of elements: ``` r bedr <- data.frame(product = c("POANG", "MALM", "RENS"), type = c("chair", "bed", "rug"), price = c(1200, 2300, 899)) rest <- data.frame(dish = c("kottbullar", "daimtarta"), price = c(89, 32)) park <- 162 ikea_uppsala <- list(bedroom = bedr, restaurant = rest, parking = park) str(ikea_uppsala) # str (structure) of an object ``` ``` ## List of 3 ## $ bedroom :'data.frame': 3 obs. of 3 variables: ## ..$ product: chr [1:3] "POANG" "MALM" "RENS" ## ..$ type : chr [1:3] "chair" "bed" "rug" ## ..$ price : num [1:3] 1200 2300 899 ## $ restaurant:'data.frame': 2 obs. of 2 variables: ## ..$ dish : chr [1:2] "kottbullar" "daimtarta" ## ..$ price: num [1:2] 89 32 ## $ parking : num 162 ``` --- name: lists_subsetting_double # Subsetting lists We can access elements of a list using the `[[]]` notation. ``` r ikea_uppsala[[2]] class(ikea_uppsala[[2]]) ``` ``` ## dish price ## 1 kottbullar 89 ## 2 daimtarta 32 ## [1] "data.frame" ``` --- name: lists_subsetting_single # Subsetting lists — .cted What if we use `[]`? We get a list back! ``` r ikea_uppsala[2] class(ikea_uppsala[2]) ``` ``` ## $restaurant ## dish price ## 1 kottbullar 89 ## 2 daimtarta 32 ## ## [1] "list" ``` -- - A piece of a list is still a list! Use `[[]]` to pull out the actual data. --- name: lists_subsetting_names # Subsetting lists — using names If the elements of a list are named, we can also use the `$` notation: ``` r ikea_uppsala$restaurant ikea_uppsala$restaurant$price ``` ``` ## dish price ## 1 kottbullar 89 ## 2 daimtarta 32 ## [1] 89 32 ``` --- name: lists_nested # Lists inside lists We can use lists to store hierarchies of data: ``` r ikea_lund <- list(parking = 125) ikea_sweden <- list(ikea_lund = ikea_lund, ikea_uppsala = ikea_uppsala) # use names to navigate inside the hierarchy ikea_sweden$ikea_lund$parking ikea_sweden$ikea_uppsala$parking ``` ``` ## [1] 125 ## [1] 162 ``` --- name: objects_type_class exclude: true # Objects — type vs. class An object of class **factor** is internally represented by numbers: ``` r size <- factor('small') class(size) # Class 'factor' mode(size) # Is represented by 'numeric' typeof(size) # Of integer type ``` ``` ## [1] "factor" ## [1] "numeric" ## [1] "integer" ``` --- name: objects_str exclude: true # Objects — structure Many functions return **objects**. We can easily examine their **structure**: ``` r his <- hist(1:5, plot=F) str(his) object.size(hist) # How much memory the object consumes ``` ``` ## List of 6 ## $ breaks : int [1:5] 1 2 3 4 5 ## $ counts : int [1:4] 2 1 1 1 ## $ density : num [1:4] 0.4 0.2 0.2 0.2 ## $ mids : num [1:4] 1.5 2.5 3.5 4.5 ## $ xname : chr "1:5" ## $ equidist: logi TRUE ## - attr(*, "class")= chr "histogram" ## 1240 bytes ``` <img src="slide_r_elements_3_files/figure-html/obj.str-1.png" width="216" style="display: block; margin: auto auto auto 0;" /> --- name: objects_fix exclude: true # Objects — fix We can easily modify values of object's **attributes**: ``` r attributes(his) attr(his, "names") #fix(his) # Opens an object editor ``` ``` ## $names ## [1] "breaks" "counts" "density" "mids" "xname" "equidist" ## ## $class ## [1] "histogram" ## ## [1] "breaks" "counts" "density" "mids" "xname" "equidist" ``` --- name: objects_lists_as_S3 exclude: true # Lists as S3 classes A list that has been named, becomes an S3 class: ``` r my.list <- list(numbers = c(1:5), letters = letters[1:5]) class(my.list) class(my.list) <- 'my.list.class' class(my.list) # Now the list is of S3 class ``` ``` ## [1] "list" ## [1] "my.list.class" ``` However, that was it. We cannot enforce that *numbers* will contain numeric values and that *letters* will contain only characters. S3 is a very primitive class. --- name: objects_S3 exclude: true # S3 classes For an S3 class we can define a *generic function* applicable to all objects of this class. ``` r print.my.list.class <- function(x) { cat('Numbers:', x$numbers, '\n') cat('Letters:', x$letters) } print(my.list) ``` ``` ## Numbers: 1 2 3 4 5 ## Letters: a b c d e ``` But here, we have no error-proofing. If the object will lack *numbers*, the function will still be called: ``` r class(his) <- 'my.list.class' # alter class print(his) # Gibberish but no error... ``` ``` ## Numbers: ## Letters: ``` --- name: objects_generics exclude: true # S3 classes — still useful? Well, S3 class mechanism is still in use, esp. when writing **generic** functions, most common examples being *print* and *plot*. For example, if you plot an object of a Manhattan.plot class, you write *plot(gwas.result)* but the true call is: *plot.manhattan(gwas.result)*. This makes life easier as it requires less writing, but it is up to the function developers to make sure everything works! --- name: objects_S4 exclude: true # S4 class mechanism S4 classes are more advanced as you actually define the structure of the data within the object of your particular class: ``` r setClass('gene', representation(name='character', coords='numeric') ) my.gene <- new('gene', name='ANK3', coords=c(1.4e6, 1.412e6)) ``` --- name: objects_S4_slots exclude: true # S4 class — slots The variables within an S4 class are stored in the so-called **slots**. In the above example, we have 2 such slots: *name* and *coords*. Here is how to access them: ``` r my.gene@name # access using @ operator my.gene@coords[2] # access the 2nd element in slot coords ``` ``` ## [1] "ANK3" ## [1] 1412000 ``` --- name: objects_S4_methods exclude: true # S4 class — methods The power of classes lies in the fact that they define both the data types in particular slots and operations (functions) we can perform on them. Let us define a *generic print function* for an S4 class: ``` r setMethod('print', 'gene', function(x) { cat('GENE: ', x@name, ' --> ') cat('[', x@coords, ']') }) print(my.gene) # and we use the newly defined print ``` ``` ## GENE: ANK3 --> [ 1400000 1412000 ] ``` <!-- --------------------- Do not edit this and below --------------------- --> --- name: end_slide class: end-slide, middle count: false # See you at the next lecture! .end-text[ <p class="smaller"> <span class="small" style="line-height: 1.2;">Graphics from </span><img src="./assets/freepik.jpg" style="max-height:20px; vertical-align:middle;"><br> Created: 31-Oct-2024 • <a href="https://www.scilifelab.se/">SciLifeLab</a> • <a href="https://nbis.se/">NBIS</a> </p> ]