2 Main lab
2.1 Data
In most of the exercises, we will use RNA-seq data (Illumina short reads) from the human A431 cell line. It is an epidermoid carcinoma cell line which is often used to study cancer and the cell cycle, and as a sort of positive control of epidermal growth factor receptor (EGFR) expression. A431 cells express very high levels of EGFR, in contrast to normal human fibroblasts.
The A431 cells were treated with gefinitib, which is an EGFR inhibitor and is used (under the trade name Iressa) as a drug to treat cancers with mutated and overactive EGFR. In the experiment, RNA was extracted at four time points: before the gefinitib treatment (t=0), and two, six and twenty-four hours after treatment (t=2, t=6, t=24, respectively), and sequenced using an Illumina HiSeq instrument in triplicates (thus there are 3x4=12 samples).
This data set or parts of it will be used in the labs on read mapping, transcript assembly, visualization, quality control and differential expression. There are many relevant questions that could be asked based on these measurements. In the QC exercise, we are going to examine if the RNA libraries that we work with are what we think they are or if there is any mislabelling. In the isoform exercise, we are going to look at some specific regions where the mass-spectrometry data indicated that novel exons or splice variants could be present at the protein level. We will use (part of) the RNA-seq data to examine if there is corresponding evidence on the mRNA level, and how different software tools could be used to detect novel gene variants.
2.2 Working on Uppmax
The step Quality control, Mapping and Quantification will be run on computing cluster Rackham which is part of UPPMAX. A standard compute node on cluster Rackham has 128 GB of RAM and 20 cores. Therefore, each core gives you 6.4 GB of RAM. We will use 8 cores per person for this session which gives you about 51 GB RAM.
2.2.1 Connecting
Log in to Uppmax in a way so that the generated graphics are exported via the network to your screen. Login in to Uppmax with X-forwarding enabled. This will allow any graphical interface that you start on your compute node to be exported to your computer.
Linux and Mac users will run this on the terminal. Windows users will run this in a tool such as MobaXterm.
ssh -Y username@rackham.uppmax.uu.se
Note that Mac users will need to have XQuartz installed for X11 graphics to work.
Once connected, the console prompt should show something like username@rackham1$
. This means you are on the login node. To view current jobs on queue, use squeue -U username
. This should show a node number like r120
etc if you have booked. If you have no bookings, see booking section below.
If you have a booked job, interactive job or any job, you can ssh on to that.
ssh -Y username@r120
Your prompt changes to username@r120$
. Now you are ready to start running heavy computations.
2.2.2 Booking resources
The code below is valid to run at the start of the day. If you are running it in the middle of a day, you need to decrease the time (-t
). Do not run this twice and also make sure you are not running computations on a login node.
Book resources for RNA-Seq day 2.
salloc -A g2019011 -t 08:00:00 -p core -n 8 --reservation=g2019011_2
Book resources for RNA-Seq day 3.
salloc -A g2019011 -t 08:00:00 -p core -n 8 --reservation=g2019011_3
Note that booking resources is only relevant for this course and not something you need to use when you regularly work with UPPMAX.
2.3 Quality control
Before doing any other analysis on mapped RNA-seq reads it is always important to do quality control of your mapped reads and that you do not have any obvious errors in your RNA-seq data.
2.4 Mapping
This section contains information on how to map reads to a reference genome using splice-aware aligner STAR and HISAT2.
2.7 Differential gene expression
We find genes that are differentially expressed between our time points.
2.8 Functional analysis
We will perform functional analysis on the differentially expressed genes to place them into a function context and possibly explain the biological consequences of DE. Methods covered are GSA (Gene set analysis) and GSEA (Gene set enrichment analysis).