Matrices, Data Frames, and Lists

class: center, middle, inverse, title-slide

.title[
# Matrices, Data Frames, and Lists
]
.subtitle[
## R Foundations for Data Analysis
]
.author[
### Marcin Kierczak, Guilherme Dias
]

---

exclude: true
count: false

---
name: contents

# Contents of the lecture

- variables and their types
- operators
- vectors
- numbers as vectors
- strings as vectors
- **matrices**
- **data frames**
- **lists**

- repeating actions: iteration and recursion
- decision taking: control structures
- functions in general
- variable scope
- core functions

---
name: matrices

# Matrices

A **matrix** is a 2-dimensional data structure. Like vectors, it consists of elements of the same type. A matrix has *rows* and *columns*.

Say, we want to construct this matrix in R:
`$$\mathbf{X} = \left[\begin{array}
{rrr}
1 & 2 & 3 \\
4 & 5 & 6 \\
7 & 8 & 9
\end{array}\right]$$`

``` r
X <- matrix(1:9, # a sequence of numbers to fill in
 nrow=3, # three rows (alt. ncol=3)
 byrow=T) # populate matrix by row
X
```

```
##      [,1] [,2] [,3]
## [1,]    1    2    3
## [2,]    4    5    6
## [3,]    7    8    9
```

---
name: matrices_dim

# Matrices &mdash; dimensions

To check the dimensions of a matrix, use `dim()`:

``` r
X
dim(X) # 3 rows and 3 columns
```

```
##      [,1] [,2] [,3]
## [1,]    1    2    3
## [2,]    4    5    6
## [3,]    7    8    9
## [1] 3 3
```

---
name: matrices_indexing

# Matrices &mdash; indexing

Elements of a matrix are retrieved using the `[]` notation.
We have to specify 2 dimensions -- the rows and the columns:

`$$\mathbf{X} = \left[\begin{array}
{rrr}
1 & 2 & 3 \\
4 & 5 & 6 \\
7 & 8 & 9
\end{array}\right]$$`

``` r
X[1,2] # Retrieve element from the 1st row, 2nd column
X[3,] # Retrieve the entire 3rd row
X[,2] # Retrieve the 2nd column
```

```
## [1] 2
## [1] 7 8 9
## [1] 2 5 8
```

---
name: matrices_indexing_2
exclude: true

# Matrices &mdash; indexing cted.

`$$\mathbf{X} = \left[\begin{array}
{rrr}
1 & 2 & 3 \\
4 & 5 & 6 \\
7 & 8 & 9
\end{array}\right]$$`

``` r
X[c(1,3),] # Retrieve rows 1 and 3
X[c(1,3),c(3,1)]
```

```
##      [,1] [,2] [,3]
## [1,]    1    2    3
## [2,]    7    8    9
##      [,1] [,2]
## [1,]    3    1
## [2,]    9    7
```

---
name: matrices_oper_1

# Matrices &mdash; operations

Usually the functions that work for a vector also work for matrices.
To order a matrix with respect to, say, 2nd column:

``` r
X <- matrix(sample(1:9,size = 9), nrow = 3)
ord <- order(X[,2])
X[ord,]
```

```
##      [,1] [,2] [,3]
## [1,]    8    3    7
## [2,]    1    4    2
## [3,]    6    9    5
```

---
name: matrices_t

# Matrices &mdash; transposition

To **transpose** a matrix use `t()`:

``` r
X
t(X)
```

```
##      [,1] [,2] [,3]
## [1,]    1    4    2
## [2,]    8    3    7
## [3,]    6    9    5
##      [,1] [,2] [,3]
## [1,]    1    8    6
## [2,]    4    3    9
## [3,]    2    7    5
```

---
name: matrices_oper_2
exclude: true

# Matrices &mdash; operations 2

To get the diagonal of the matrix:

``` r
X
diag(X) # get values on the diagonal
```

```
##      [,1] [,2] [,3]
## [1,]    1    4    2
## [2,]    8    3    7
## [3,]    6    9    5
## [1] 1 3 5
```

---
name: matrices_tri
exclude: true

# Matrices &mdash; operations, triangles

To get the upper or the lower triangle use `upper.tri()` and `lower.tri()` respectively:

``` r
X # print X
upper.tri(X) # which elements form the upper triangle
X[upper.tri(X)] <- 0 # set them to 0
X # print the new matrix
```

```
##      [,1] [,2] [,3]
## [1,]    1    4    2
## [2,]    8    3    7
## [3,]    6    9    5
##       [,1]  [,2]  [,3]
## [1,] FALSE  TRUE  TRUE
## [2,] FALSE FALSE  TRUE
## [3,] FALSE FALSE FALSE
##      [,1] [,2] [,3]
## [1,]    1    0    0
## [2,]    8    3    0
## [3,]    6    9    5
```

---
name: matrices_multi

# Matrices &mdash; multiplication

Different types of matrix multiplication exist:

``` r
A <- matrix(1:4, nrow = 2, byrow=T)
B <- matrix(5:8, nrow = 2, byrow=T)
A * B # Hadamard product
A %*% B # Matrix multiplication
# A %x% B # Kronecker product
# A %o% B # Outer product (tensor product)
```

```
##      [,1] [,2]
## [1,]    5   12
## [2,]   21   32
##      [,1] [,2]
## [1,]   19   22
## [2,]   43   50
```

---
name: matrices_outer
exclude: true

# Matrices &mdash; outer

Outer product can be useful for generating names

``` r
outer(letters[1:4], LETTERS[1:4], paste, sep="-")
```

```
##      [,1]  [,2]  [,3]  [,4] 
## [1,] "a-A" "a-B" "a-C" "a-D"
## [2,] "b-A" "b-B" "b-C" "b-D"
## [3,] "c-A" "c-B" "c-C" "c-D"
## [4,] "d-A" "d-B" "d-C" "d-D"
```

---
name: matrices_expand_grid
exclude: true

# Expand grid

But `expand.grid()` is more convenient when you want, e.g. generate combinations of variable values:

``` r
expand.grid(height = seq(120, 121),
            weight = c('1-50', '51+'),
            sex = c("Male","Female"))
```

```
##   height weight    sex
## 1    120   1-50   Male
## 2    121   1-50   Male
## 3    120    51+   Male
## 4    121    51+   Male
## 5    120   1-50 Female
## 6    121   1-50 Female
## 7    120    51+ Female
## 8    121    51+ Female
```

---
name: matrices_apply

# Matrices &mdash; apply

Function `apply` is a very useful function that applies a given function to either each value of the matrix or in a column/row-wise manner. Say, we want to have mean of values by column:

``` r
X
apply(X, MARGIN=2, mean) # MARGIN=1 would do it for rows
```

```
##      [,1] [,2] [,3]
## [1,]    1    0    0
## [2,]    8    3    0
## [3,]    6    9    5
## [1] 5.000000 4.000000 1.666667
```

---
name: matrices_apply_2
exclude: true

# Matrices &mdash; apply cted.

And now we will use `apply()` to calculate for each element in a matrix its deviation from the mean squared:

``` r
X
my.mean <- mean(X)
apply(X, MARGIN=c(1,2),
 function(x, my.mean) (x - my.mean)^2, my.mean)
```

```
##      [,1] [,2] [,3]
## [1,]    1    0    0
## [2,]    8    3    0
## [3,]    6    9    5
##           [,1]      [,2]     [,3]
## [1,]  6.530864 12.641975 12.64198
## [2,] 19.753086  0.308642 12.64198
## [3,]  5.975309 29.641975  2.08642
```

---
name: matrices_colSums

# Matrices &mdash; useful fns.

While `apply()` is handy, it is a bit slow and for the most common statistics, there are special functions col/row Sums/Means:

``` r
X
colMeans(X)
```

```
##      [,1] [,2] [,3]
## [1,]    1    0    0
## [2,]    8    3    0
## [3,]    6    9    5
## [1] 5.000000 4.000000 1.666667
```
These functions are faster!

---
name: matrices_add_row_col

# Matrices &mdash; adding rows/columns

To add rows or columns to a matrix; or to make a matrix out of two or more vectors of equal length:

``` r
x <- c(1,1,1)
y <- c(2,2,2)
cbind(x,y)
rbind(x,y)
```

```
##      x y
## [1,] 1 2
## [2,] 1 2
## [3,] 1 2
##   [,1] [,2] [,3]
## x    1    1    1
## y    2    2    2
```

---
name: matrices_arrays
exclude: true

# Matrices &mdash; more dimensions

``` r
dim(Titanic)
```

```
## [1] 4 2 2 2
```

--
exclude: true

``` r
library(vcd)
```

```
## Loading required package: grid
```

``` r
mosaic(Titanic, gp_labels=gpar(fontsize=7))
```

---
name: data_frames_1

# Data frames

- **Data frames** are also two-dimensional data structures.
- Different columns can have different data types!
- Technically, a data frame is just a list of vectors.

--

<center>
.size-70[
![](images/data_frame.png)
]
</center>

---
name: data_frames_create

# Data frames &mdash; creating a data frame

``` r
df <- data.frame(c(1:5),
 LETTERS[1:5],
 c(T,F,F,T,T))
df
```

```
##   c.1.5. LETTERS.1.5. c.T..F..F..T..T.
## 1      1            A             TRUE
## 2      2            B             TRUE
## 3      3            C             TRUE
## 4      4            D             TRUE
## 5      5            E             TRUE
```

---
name: data_frames_columns

# Data frames &mdash; name your columns!

- Always try to give meaningful names to your columns

``` r
df <- data.frame(numbers=c(1:5),
 letters=c('a','b','c','d','e'),
 logical=c(T,F,F,T,T))
df
```

```
##   numbers letters logical
## 1       1       a    TRUE
## 2       2       b    TRUE
## 3       3       c    TRUE
## 4       4       d    TRUE
## 5       5       e    TRUE
```

---
name: data_frames_accessing

# Data frames &mdash; accessing values

- We can always use the `[row,column]` notation to access values inside data frames.

``` r
df[1,]  # get the first row
df[,2]  # the second column
df[2:3, 'letters'] # get rows 2-3 from the 'letters' column
```

```
##   numbers letters logical
## 1       1       a    TRUE
## [1] "a" "b" "c" "d" "e"
## [1] "b" "c"
```

---
name: data_frames_dollar

# Data frames &mdash; accessing values

- We can also use dollar sign `$` to access columns

``` r
df$letters # get the column named 'letters'
df$letters[2:3] # get the second and third elements of the column named 'letters'
```

```
## [1] "a" "b" "c" "d" "e"
## [1] "b" "c"
```

---
name: data_frames_factors_1
exclude: true

# Data frames &mdash; factors

An interesting observation:

``` r
df$letter
df$letter <- as.character(df$letter)
df$letter
```

```
## [1] "a" "b" "c" "d" "e"
## [1] "a" "b" "c" "d" "e"
```
---
name: data_frames_factors_2
exclude: true

# Data frames &mdash; factors cted.

To treat characters as characters at data frame creation time, one can use the **stringsAsFactors** option set to TRUE:

``` r
df <- data.frame(no=c(1:5),
 letter=c("a","b","c","d","e"),
 isBrown=sample(c(TRUE, FALSE),
 size = 5,
 replace=T),
 stringsAsFactors = TRUE)
df$letter
```

```
## [1] a b c d e
## Levels: a b c d e
```
Well, as you see, it did not work as expected...

---
name: data_frames_names

# Data frames &mdash; names

To get or change row/column names:

``` r
colnames(df) # get column names
colnames(df) <- c('num','let','logi') # assign column names
colnames(df)
rownames(df) # get row names
rownames(df) <- letters[1:5] # assign row names
rownames(df)
```

```
## [1] "no"      "letter"  "isBrown"
## [1] "num"  "let"  "logi"
## [1] "1" "2" "3" "4" "5"
## [1] "a" "b" "c" "d" "e"
```

---
name: data_frames_merging

# Data frames &mdash; merging

We can merge two data frames on certain a key using  `merge()`:

``` r
age <- data.frame(ID=c(1:4),
 age=c(37,48,22,NA))
clinical <- data.frame(ID=c(1:4),
 status=c("sick","healthy","healthy","sick"))
patients <- merge(age, clinical, by='ID')
patients
```

```
##   ID age  status
## 1  1  37    sick
## 2  2  48 healthy
## 3  3  22 healthy
## 4  4  NA    sick
```

---
name: data_frames_summarizing

# Data frames &mdash; summarising

To get an overview of the data in each column, use `summary()`:

``` r
summary(patients)
```

```
##        ID            age           status         
##  Min.   :1.00   Min.   :22.00   Length:4          
##  1st Qu.:1.75   1st Qu.:29.50   Class :character  
##  Median :2.50   Median :37.00   Mode  :character  
##  Mean   :2.50   Mean   :35.67                     
##  3rd Qu.:3.25   3rd Qu.:42.50                     
##  Max.   :4.00   Max.   :48.00                     
##                 NA's   :1
```

---
name: data_frames_missing

# Data frames &mdash; missing data

We can use functions to deal with missing values:

``` r
is.na(patients) # check where the NAs are
na.omit(patients) # remove all rows containing NAs
patients[rowSums(is.na(patients)) > 0,] # select rows containing NAs
```

```
##         ID   age status
## [1,] FALSE FALSE  FALSE
## [2,] FALSE FALSE  FALSE
## [3,] FALSE FALSE  FALSE
## [4,] FALSE  TRUE  FALSE
##   ID age  status
## 1  1  37    sick
## 2  2  48 healthy
## 3  3  22 healthy
##   ID age status
## 4  4  NA   sick
```

---
name: lists_1

# Lists  &mdash; collections of various data types

A list is a collection of elements:

``` r
bedr <- data.frame(product = c("POANG", "MALM", "RENS"),
 type = c("chair", "bed", "rug"),
 price = c(1200, 2300, 899))
rest <- data.frame(dish = c("kottbullar", "daimtarta"),
 price = c(89, 32))
park <- 162

ikea_uppsala <- list(bedroom = bedr, 
 restaurant = rest, 
 parking = park)
str(ikea_uppsala) # str (structure) of an object
```

```
## List of 3
##  $ bedroom   :'data.frame':	3 obs. of  3 variables:
##   ..$ product: chr [1:3] "POANG" "MALM" "RENS"
##   ..$ type   : chr [1:3] "chair" "bed" "rug"
##   ..$ price  : num [1:3] 1200 2300 899
##  $ restaurant:'data.frame':	2 obs. of  2 variables:
##   ..$ dish : chr [1:2] "kottbullar" "daimtarta"
##   ..$ price: num [1:2] 89 32
##  $ parking   : num 162
```

---
name: lists_subsetting_double

# Subsetting lists

We can access elements of a list using the `[[]]` notation.

``` r
ikea_uppsala[[2]]
class(ikea_uppsala[[2]])
```

```
##         dish price
## 1 kottbullar    89
## 2  daimtarta    32
## [1] "data.frame"
```

---
name: lists_subsetting_single

# Subsetting lists &mdash; .cted

What if we use `[]`? We get a list back!

``` r
ikea_uppsala[2]
class(ikea_uppsala[2])
```

```
## $restaurant
##         dish price
## 1 kottbullar    89
## 2  daimtarta    32
## 
## [1] "list"
```

- A piece of a list is still a list! Use `[[]]` to pull out the actual data.

---
name: lists_subsetting_names

# Subsetting lists &mdash; using names

If the elements of a list are named, we can also use the `$` notation:

``` r
ikea_uppsala$restaurant
ikea_uppsala$restaurant$price
```

```
##         dish price
## 1 kottbullar    89
## 2  daimtarta    32
## [1] 89 32
```

---
name: lists_nested

# Lists inside lists

We can use lists to store hierarchies of data:

``` r
ikea_lund <- list(parking = 125)
ikea_sweden <- list(ikea_lund = ikea_lund, 
 ikea_uppsala = ikea_uppsala)
# use names to navigate inside the hierarchy
ikea_sweden$ikea_lund$parking
ikea_sweden$ikea_uppsala$parking
```

```
## [1] 125
## [1] 162
```

---
name: objects_type_class
exclude: true

# Objects &mdash; type vs. class

An object of class **factor** is internally represented by numbers:

``` r
size <- factor('small')
class(size) # Class 'factor'
mode(size) # Is represented by 'numeric'
typeof(size) # Of integer type
```

```
## [1] "factor"
## [1] "numeric"
## [1] "integer"
```

---
name: objects_str
exclude: true

# Objects &mdash; structure

Many functions return **objects**. We can easily examine their **structure**:

``` r
his <- hist(1:5, plot=F)
str(his)
object.size(hist) # How much memory the object consumes
```

```
## List of 6
##  $ breaks  : int [1:5] 1 2 3 4 5
##  $ counts  : int [1:4] 2 1 1 1
##  $ density : num [1:4] 0.4 0.2 0.2 0.2
##  $ mids    : num [1:4] 1.5 2.5 3.5 4.5
##  $ xname   : chr "1:5"
##  $ equidist: logi TRUE
##  - attr(*, "class")= chr "histogram"
## 1240 bytes
```

---
name: objects_fix
exclude: true

# Objects &mdash; fix

We can easily modify values of object's **attributes**:

``` r
attributes(his)
attr(his, "names")
#fix(his) # Opens an object editor
```

```
## $names
## [1] "breaks"   "counts"   "density"  "mids"     "xname"    "equidist"
## 
## $class
## [1] "histogram"
## 
## [1] "breaks"   "counts"   "density"  "mids"     "xname"    "equidist"
```

---
name: objects_lists_as_S3
exclude: true

# Lists as S3 classes

A list that has been named, becomes an S3 class:

``` r
my.list <- list(numbers = c(1:5),
 letters = letters[1:5])
class(my.list)
class(my.list) <- 'my.list.class'
class(my.list) # Now the list is of S3 class
```

```
## [1] "list"
## [1] "my.list.class"
```

However, that was it. We cannot enforce that *numbers* will contain numeric values and that *letters* will contain only characters. S3 is a very primitive class.

---
name: objects_S3
exclude: true

# S3 classes

For an S3 class we can define a *generic function* applicable to all objects of this class.

``` r
print.my.list.class <- function(x) {
 cat('Numbers:', x$numbers, '\n')
 cat('Letters:', x$letters)
}
print(my.list)
```

```
## Numbers: 1 2 3 4 5 
## Letters: a b c d e
```

But here, we have no error-proofing. If the object will lack *numbers*, the function will still be called:

``` r
class(his) <- 'my.list.class' # alter class
print(his) # Gibberish but no error...
```

```
## Numbers: 
## Letters:
```

---
name: objects_generics
exclude: true

# S3 classes &mdash; still useful?

Well, S3 class mechanism is still in use, esp. when writing **generic** functions, most common examples being *print* and *plot*. For example, if you plot an object of a Manhattan.plot class, you write *plot(gwas.result)* but the true call is: *plot.manhattan(gwas.result)*. This makes life easier as it requires less writing, but it is up to the function developers to make sure everything works!

---
name: objects_S4
exclude: true

# S4 class mechanism

S4 classes are more advanced as you actually define the structure of the data within the object of your particular class:

``` r
setClass('gene',
 representation(name='character',
 coords='numeric')
 )
my.gene <- new('gene', name='ANK3',
 coords=c(1.4e6, 1.412e6))
```

---
name: objects_S4_slots
exclude: true

# S4 class &mdash; slots

The variables within an S4 class are stored in the so-called **slots**. In the above example, we have 2 such slots: *name* and *coords*. Here is how to access them:

``` r
my.gene@name # access using @ operator
my.gene@coords[2] # access the 2nd element in slot coords
```

```
## [1] "ANK3"
## [1] 1412000
```

---
name: objects_S4_methods
exclude: true

# S4 class &mdash; methods

The power of classes lies in the fact that they define both the data types in particular slots and operations (functions) we can perform on them. Let us define a *generic print function* for an S4 class:

``` r
setMethod('print', 'gene',
          function(x) {
              cat('GENE: ', x@name, ' --> ')
              cat('[', x@coords, ']')
          })
print(my.gene) # and we use the newly defined print
```

```
## GENE:  ANK3  --> [ 1400000 1412000 ]
```

---
name: end_slide
class: end-slide, middle
count: false

# See you at the next lecture!

.end-text[

Graphics from <img src="./assets/freepik.jpg" style="max-height:20px; vertical-align:middle;"> 
Created: 31-Oct-2024 • <a href="https://www.scilifelab.se/">SciLifeLab</a> • <a href="https://nbis.se/">NBIS</a> 

]