14 Exercises: Principal component analysis

Author

Mun-Gwan Hong

Published

September 15, 2022

14.0.0.1 Exercise 1

Let’s start principal component analysis (PCA) with Edgar Anderson’s iris data, which is already available as iris in R.

Examine the contents and data types of the data set iris. R functions str, summary and pairs can be useful for this.

'data.frame':   150 obs. of  5 variables:
 $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
 $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
 $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
 $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
 $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...

  Sepal.Length   Sepal.Width    Petal.Length   Petal.Width        Species  
 Min.   :4.30   Min.   :2.00   Min.   :1.00   Min.   :0.1   setosa    :50  
 1st Qu.:5.10   1st Qu.:2.80   1st Qu.:1.60   1st Qu.:0.3   versicolor:50  
 Median :5.80   Median :3.00   Median :4.35   Median :1.3   virginica :50  
 Mean   :5.84   Mean   :3.06   Mean   :3.76   Mean   :1.2                  
 3rd Qu.:6.40   3rd Qu.:3.30   3rd Qu.:5.10   3rd Qu.:1.8                  
 Max.   :7.90   Max.   :4.40   Max.   :6.90   Max.   :2.5

As Setosa is somewhat different from the other two species, we will limit our analysis to these two. Leave the data of Versicolor and Virginica only, and examine the contents of the data.

The first 4 columns contains 4 different measures and the last column has species. Split the data into a numeric matrix m and a character vector species.

Transform each variable to have the same mean 0 and the same standard deviation using scale.

Compute eigenvectors and eigenvalues of the variance-covariance matrix of the transformed (or scaled) m using cov and eigen.

eigen() decomposition
$values
[1] 2.958 0.555 0.406 0.081

$vectors
      [,1]  [,2]   [,3]  [,4]
[1,] -0.51  0.22  0.693  0.46
[2,] -0.43 -0.89  0.056 -0.14
[3,] -0.54  0.38 -0.019 -0.75
[4,] -0.51  0.13 -0.718  0.46

Compute the scores of the first principal component (PC). Matrix dot product in R is %*% as A %*% B for \(\mathbf{A} \mathbf{B}\).

Compute the proportion of variance explained by each PC and find the rotation matrix or loading matrix.

[1] 0.74 0.14 0.10 0.02

      [,1]  [,2]   [,3]  [,4]
[1,] -0.51  0.22  0.693  0.46
[2,] -0.43 -0.89  0.056 -0.14
[3,] -0.54  0.38 -0.019 -0.75
[4,] -0.51  0.13 -0.718  0.46

      [,1]  [,2]   [,3]   [,4]
[1,] -0.87  0.16  0.441  0.132
[2,] -0.75 -0.66  0.035 -0.039
[3,] -0.94  0.28 -0.012 -0.213
[4,] -0.87  0.10 -0.458  0.130

Using prcomp, perform PCA. Don’t forget to add scale. = TRUE in the arguments of the function. The default is scale. = FALSE though the help of the function states scaling is recommended.

Compare the scores from our manual computation using eigenvector and from prcomp.

[1] TRUE

Compute the proportion of explained variance using the output from prcomp. Compare the results with that of g).

[1] 0.74 0.14 0.10 0.02

Find the rotation matrix or loading matrix from the output of prcomp and compare it with that of g).

               PC1   PC2    PC3   PC4
Sepal.Length -0.51 -0.22  0.693 -0.46
Sepal.Width  -0.43  0.89  0.056  0.14
Petal.Length -0.54 -0.38 -0.019  0.75
Petal.Width  -0.51 -0.13 -0.718 -0.46

              [,1]  [,2]   [,3]   [,4]
Sepal.Length -0.87 -0.16  0.441 -0.132
Sepal.Width  -0.75  0.66  0.035  0.039
Petal.Length -0.94 -0.28 -0.012  0.213
Petal.Width  -0.87 -0.10 -0.458 -0.130

Draw a PCA score plot for PC1 and PC2. Add colors by species. What does the plot show?

Two species were separated by the first PC.

Draw a PCA score plot for PC2 and PC3.

Draw a scree plot.

Draw a bi-plot. Can you tell
1. which variables are the most important for PC1?
2. which variables are highly correlated?

Sepal.Length, Petal.Length and Petal.Width are the most important variables for PC1, as their loadings are far from zero.
They are strongly correlated as they are next to each other in loading plot.

You may find the loadings displayed in the biplot are in different scale from those you computed. It is primarily due to the scaling by function biplot. Please refer to https://stats.stackexchange.com/questions/276645/arrows-of-underlying-variables-in-pca-biplot-in-r for the question.

14.0.0.2 Exercise 2

NCI60 data set consists of gene expression values for 6830 genes for 64 cell lines. Using this data set, repeat the analysis using prcomp of Exercise 1. You can download the data using the command below.

m <- t(as.matrix(read.table(
  url("https://web.stanford.edu/~hastie/ElemStatLearn/datasets/nci.data.csv"),
  sep = ",", row.names = 1, header = TRUE
)))

Examine the contents and data types of the data set nci. Check the dimension of the data first. This is much larger than iris.

 num [1:64, 1:6830] 0.3 0.68 0.94 0.28 0.485 ...
 - attr(*, "dimnames")=List of 2
  ..$ : chr [1:64] "s1" "s2" "s3" "s4" ...
  ..$ : chr [1:6830] "g1" "g2" "g3" "g4" ...

       g1                g2                g3                g4        
 Min.   :-1.0600   Min.   :-2.1900   Min.   :-1.7100   Min.   :-2.610  
 1st Qu.:-0.3725   1st Qu.:-0.4050   1st Qu.:-0.1925   1st Qu.:-1.323  
 Median : 0.0000   Median : 0.0000   Median : 0.0000   Median : 0.000  
 Mean   :-0.0191   Mean   :-0.0278   Mean   :-0.0199   Mean   :-0.329  
 3rd Qu.: 0.3100   3rd Qu.: 0.3525   3rd Qu.: 0.1625   3rd Qu.: 0.693  
 Max.   : 0.9400   Max.   : 2.2400   Max.   : 1.1500   Max.   : 1.500  
       g5                g6                 g7                g8         
 Min.   :-0.8250   Min.   :-0.70000   Min.   :-0.9200   Min.   :-0.7050  
 1st Qu.:-0.2250   1st Qu.:-0.15625   1st Qu.:-0.2462   1st Qu.:-0.2050  
 Median : 0.0000   Median : 0.00000   Median : 0.0000   Median : 0.0000  
 Mean   : 0.0261   Mean   : 0.00672   Mean   : 0.0197   Mean   :-0.0231  
 3rd Qu.: 0.2100   3rd Qu.: 0.18500   3rd Qu.: 0.2475   3rd Qu.: 0.1600  
 Max.   : 1.7150   Max.   : 1.16000   Max.   : 0.9400   Max.   : 0.7250  
       g9                g10         
 Min.   :-0.90000   Min.   :-1.6550  
 1st Qu.:-0.36000   1st Qu.:-0.4775  
 Median : 0.00000   Median : 0.0000  
 Mean   : 0.00078   Mean   : 0.0192  
 3rd Qu.: 0.29500   3rd Qu.: 0.4675  
 Max.   : 0.99000   Max.   : 1.4900

Using prcomp, perform PCA.

Compute the proportion of variance explained by each PC.

 [1] 1.1e-01 6.8e-02 5.8e-02 4.2e-02 3.7e-02 3.6e-02 3.1e-02 2.7e-02 2.5e-02
[10] 2.4e-02 2.4e-02 2.2e-02 2.0e-02 2.0e-02 1.9e-02 1.8e-02 1.7e-02 1.6e-02
[19] 1.6e-02 1.6e-02 1.5e-02 1.5e-02 1.4e-02 1.4e-02 1.3e-02 1.3e-02 1.3e-02
[28] 1.2e-02 1.2e-02 1.1e-02 1.1e-02 1.0e-02 1.0e-02 9.9e-03 9.7e-03 9.3e-03
[37] 9.1e-03 9.0e-03 8.7e-03 8.4e-03 8.1e-03 7.9e-03 7.4e-03 7.2e-03 7.1e-03
[46] 6.9e-03 6.8e-03 6.5e-03 6.4e-03 6.0e-03 5.7e-03 5.6e-03 5.4e-03 5.1e-03
[55] 5.1e-03 4.8e-03 4.4e-03 4.1e-03 3.7e-03 3.2e-03 2.6e-03 2.4e-03 2.4e-03
[64] 2.2e-32

Find the rotation matrix. Show the first 5 columns and rows only.

       PC1     PC2    PC3     PC4     PC5
g1 -0.0107  0.0013 0.0085 -0.0035 -0.0101
g2 -0.0023  0.0017 0.0103  0.0026 -0.0114
g3 -0.0059 -0.0063 0.0101 -0.0107  0.0103
g4  0.0033  0.0027 0.0084 -0.0075  0.0112
g5 -0.0077 -0.0025 0.0138  0.0095  0.0041

Draw a PCA score plot for PC1 and PC2.

Draw a PCA score plot for PC2 and PC3.

Draw a scree plot.

Draw a bi-plot. Note it can take long time.