Exercises

You have found some older gene expression data, based on the microarray technology. They contains measurements for 22215 genes for 189 samples, across 7 tissue (kidney, hippocampus, cerebellum, colon, liver, endometrium and placenta).

Data can be loaded and preview:

[1] 22215   189
          GSM11805.CEL.gz GSM11814.CEL.gz GSM11823.CEL.gz GSM11830.CEL.gz
1007_s_at       10.191267       10.509167       10.272027       10.252952
1053_at          6.040463        6.696075        6.144663        6.575153
117_at           7.447409        7.775354        7.696235        8.478135
121_at          12.025042       12.007817       11.633279       11.075286
          GSM12067.CEL.gz
1007_s_at       10.157605
1053_at          6.606701
117_at           8.116336
121_at          10.832528
'data.frame':   189 obs. of  6 variables:
 $ filename     : chr  "GSM11805.CEL.gz" "GSM11814.CEL.gz" "GSM11823.CEL.gz" "GSM11830.CEL.gz" ...
 $ DB_ID        : chr  "GSM11805" "GSM11814" "GSM11823" "GSM11830" ...
 $ ExperimentID : chr  "GSE781" "GSE781" "GSE781" "GSE781" ...
 $ Tissue       : chr  "kidney" "kidney" "kidney" "kidney" ...
 $ SubType      : chr  "normal" "cancer" "normal" "cancer" ...
 $ ClinicalGroup: chr  NA NA NA NA ...

Exercise 1 (Partition methods) Use first two genes only and run k-means clustering.

  1. Find optimal number of \(k\) using Silhouette method.
  2. Plot samples using the first two genes as x and y coordinates and visualize your cluster results on one a scatter plot.
  3. Use first 1000 genes and the same value of \(k\). Is this clustering solution better now? How can you tell?

Exercise 2 (HCL) Select samples corresponding to two tissues of your choice. Run HCL and compare dendrograms:

  1. with complete and ward linkage, distance measure Euclidean
  2. with complete and ward linkage, distance measure Canberra

Exercise 3 (Pvclust) Try running pvclust on the samples you’ve chosen above. Which clusters are supported by bootstrapping?

Exercise 4 (Heatmap) Select top 100 genes based on variance (with highest variance). Make a heatmap using ComplexHeatmap package. Group columns (samples) by tissue and split rows (genes) using k-means (k = 7).

  1. Do you see any interesting patterns?
  2. How would you go about extracting genes belonging to a specific cluster, if you were interested in running functional annotations on those?

Answers to exercises

Solution. Exercise 1

Solution. Exercise 2

 cerebellum       colon endometrium hippocampus      kidney       liver 
         38          34          15          31          39          26 
   placenta 
          6