Exercise 1
Let’s start principal component analysis (PCA) with Edgar Anderson’s iris data, which is already available as iris in R.
- Examine the contents and data types of the data set iris. R functions
str, summary and pairs can be useful for this.
'data.frame': 150 obs. of 5 variables:
$ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
$ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
$ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
$ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
$ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
Min. :4.30 Min. :2.00 Min. :1.00 Min. :0.1 setosa :50
1st Qu.:5.10 1st Qu.:2.80 1st Qu.:1.60 1st Qu.:0.3 versicolor:50
Median :5.80 Median :3.00 Median :4.35 Median :1.3 virginica :50
Mean :5.84 Mean :3.06 Mean :3.76 Mean :1.2
3rd Qu.:6.40 3rd Qu.:3.30 3rd Qu.:5.10 3rd Qu.:1.8
Max. :7.90 Max. :4.40 Max. :6.90 Max. :2.5
- As Setosa is somewhat different from the other two species, we will limit our analysis to these two. Leave the data of Versicolor and Virginica only, and examine the contents of the data.
- The first 4 columns contains 4 different measures and the last column has species. Split the data into a numeric matrix
m and a character vector species.
- Transform each variable to have the same mean 0 and the same standard deviation using
scale.
- Compute eigenvectors and eigenvalues of the variance-covariance matrix of the transformed (or scaled)
m using cov and eigen.
eigen() decomposition
$values
[1] 2.958 0.555 0.406 0.081
$vectors
[,1] [,2] [,3] [,4]
[1,] -0.51 0.22 0.693 0.46
[2,] -0.43 -0.89 0.056 -0.14
[3,] -0.54 0.38 -0.019 -0.75
[4,] -0.51 0.13 -0.718 0.46
- Compute the scores of the first principal component (PC). Matrix dot product in R is
%*% as A %*% B for \(\mathbf{A} \mathbf{B}\).
[,1]
51 -0.527
52 -0.056
53 -0.572
54 2.377
55 0.324
56 1.241
- Compute the proportion of variance explained by each PC and find the rotation matrix or loading matrix.
[,1] [,2] [,3] [,4]
[1,] -0.51 0.22 0.693 0.46
[2,] -0.43 -0.89 0.056 -0.14
[3,] -0.54 0.38 -0.019 -0.75
[4,] -0.51 0.13 -0.718 0.46
[,1] [,2] [,3] [,4]
[1,] -0.87 0.16 0.441 0.132
[2,] -0.75 -0.66 0.035 -0.039
[3,] -0.94 0.28 -0.012 -0.213
[4,] -0.87 0.10 -0.458 0.130
- Using
prcomp, perform PCA. Don’t forget to add scale. = TRUE in the arguments of the function. The default is scale. = FALSE though the help of the function states scaling is recommended.
- Compare the scores from our manual computation using eigenvector and from
prcomp.
- Compute the proportion of explained variance using the output from
prcomp. Compare the results with that of g).
- Find the rotation matrix or loading matrix from the output of
prcomp and compare it with that of g).
PC1 PC2 PC3 PC4
Sepal.Length -0.51 -0.22 0.693 -0.46
Sepal.Width -0.43 0.89 0.056 0.14
Petal.Length -0.54 -0.38 -0.019 0.75
Petal.Width -0.51 -0.13 -0.718 -0.46
[,1] [,2] [,3] [,4]
Sepal.Length -0.87 -0.16 0.441 -0.132
Sepal.Width -0.75 0.66 0.035 0.039
Petal.Length -0.94 -0.28 -0.012 0.213
Petal.Width -0.87 -0.10 -0.458 -0.130
- Draw a PCA score plot for PC1 and PC2. Add colors by
species. What does the plot show?
Two species were separated by the first PC.
- Draw a PCA score plot for PC2 and PC3.
- Draw a scree plot.
- Draw a bi-plot. Can you tell
- which variables are the most important for PC1?
- which variables are highly correlated?
Sepal.Length, Petal.Length and Petal.Width are the most important variables for PC1, as their loadings are far from zero.
- They are strongly correlated as they are next to each other in loading plot.
You may find the loadings displayed in the biplot are in different scale from those you computed. It is primarily due to the scaling by function biplot. Please refer to https://stats.stackexchange.com/questions/276645/arrows-of-underlying-variables-in-pca-biplot-in-r for the question.
Exercise 2
NCI60 data set consists of gene expression values for 6830 genes for 64 cell lines. Using this data set, repeat the analysis using prcomp of Exercise 1. You can download the data using the command below.
m <- t(as.matrix(read.table(
url("https://web.stanford.edu/~hastie/ElemStatLearn/datasets/nci.data.csv"),
sep = ",", row.names = 1, header = TRUE
)))
- Examine the contents and data types of the data set
nci. Check the dimension of the data first. This is much larger than iris.
num [1:64, 1:6830] 0.3 0.68 0.94 0.28 0.485 ...
- attr(*, "dimnames")=List of 2
..$ : chr [1:64] "s1" "s2" "s3" "s4" ...
..$ : chr [1:6830] "g1" "g2" "g3" "g4" ...
g1 g2 g3 g4
Min. :-1.0600 Min. :-2.1900 Min. :-1.7100 Min. :-2.610
1st Qu.:-0.3725 1st Qu.:-0.4050 1st Qu.:-0.1925 1st Qu.:-1.323
Median : 0.0000 Median : 0.0000 Median : 0.0000 Median : 0.000
Mean :-0.0191 Mean :-0.0278 Mean :-0.0199 Mean :-0.329
3rd Qu.: 0.3100 3rd Qu.: 0.3525 3rd Qu.: 0.1625 3rd Qu.: 0.693
Max. : 0.9400 Max. : 2.2400 Max. : 1.1500 Max. : 1.500
g5 g6 g7 g8
Min. :-0.8250 Min. :-0.70000 Min. :-0.9200 Min. :-0.7050
1st Qu.:-0.2250 1st Qu.:-0.15625 1st Qu.:-0.2462 1st Qu.:-0.2050
Median : 0.0000 Median : 0.00000 Median : 0.0000 Median : 0.0000
Mean : 0.0261 Mean : 0.00672 Mean : 0.0197 Mean :-0.0231
3rd Qu.: 0.2100 3rd Qu.: 0.18500 3rd Qu.: 0.2475 3rd Qu.: 0.1600
Max. : 1.7150 Max. : 1.16000 Max. : 0.9400 Max. : 0.7250
g9 g10
Min. :-0.90000 Min. :-1.6550
1st Qu.:-0.36000 1st Qu.:-0.4775
Median : 0.00000 Median : 0.0000
Mean : 0.00078 Mean : 0.0192
3rd Qu.: 0.29500 3rd Qu.: 0.4675
Max. : 0.99000 Max. : 1.4900
- Using
prcomp, perform PCA.
- Compute the proportion of variance explained by each PC.
[1] 1.1e-01 6.8e-02 5.8e-02 4.2e-02 3.7e-02 3.6e-02 3.1e-02 2.7e-02 2.5e-02
[10] 2.4e-02 2.4e-02 2.2e-02 2.0e-02 2.0e-02 1.9e-02 1.8e-02 1.7e-02 1.6e-02
[19] 1.6e-02 1.6e-02 1.5e-02 1.5e-02 1.4e-02 1.4e-02 1.3e-02 1.3e-02 1.3e-02
[28] 1.2e-02 1.2e-02 1.1e-02 1.1e-02 1.0e-02 1.0e-02 9.9e-03 9.7e-03 9.3e-03
[37] 9.1e-03 9.0e-03 8.7e-03 8.4e-03 8.1e-03 7.9e-03 7.4e-03 7.2e-03 7.1e-03
[46] 6.9e-03 6.8e-03 6.5e-03 6.4e-03 6.0e-03 5.7e-03 5.6e-03 5.4e-03 5.1e-03
[55] 5.1e-03 4.8e-03 4.4e-03 4.1e-03 3.7e-03 3.2e-03 2.6e-03 2.4e-03 2.4e-03
[64] 2.2e-32
- Find the rotation matrix. Show the first 5 columns and rows only.
PC1 PC2 PC3 PC4 PC5
g1 -0.0107 0.0013 0.0085 -0.0035 -0.0101
g2 -0.0023 0.0017 0.0103 0.0026 -0.0114
g3 -0.0059 -0.0063 0.0101 -0.0107 0.0103
g4 0.0033 0.0027 0.0084 -0.0075 0.0112
g5 -0.0077 -0.0025 0.0138 0.0095 0.0041
- Draw a PCA score plot for PC1 and PC2.
- Draw a PCA score plot for PC2 and PC3.
- Draw a scree plot.
- Draw a bi-plot. Note it can take long time.