This page contains links to different tutorials that are used in this course. The tutorials are well documented and should be easy to follow.

Input code blocks are displayed like shown below. The code language is displayed above the block. Shell scripts (SH) are to be executed in the linux terminal such as bash. R scripts are to be run in R either through the terminal, RGui or RStudio.

sh
command

Note   Tip   Discuss   Task


1 Introduction

Basic introduction to working in the Linux environment using a command-line interface.

Introduction to Linux

Handling and analysing large data may not be feasible on a local computer. We have access to a computing cluster called UPPMAX. Below are exercises to start working on UPPMAX.

Introduction to Uppmax

Most of the analyses is carried out in R and it will be useful to learn some basic R.

Introduction to R

This topic covers retrieving supporting data needed for RNA-seq analyses. These include gene annotation IDs such as mapping between Ensembl IDs and Gene IDs, GO terms and transcript IDs. We also cover retrieving genomic data from Ensembl.

Downloading data

2 Main lab

2.1 Data

In most of the exercises, we will use RNA-seq data (Illumina short reads) from the human A431 cell line. It is an epidermoid carcinoma cell line which is often used to study cancer and the cell cycle, and as a sort of positive control of epidermal growth factor receptor (EGFR) expression. A431 cells express very high levels of EGFR, in contrast to normal human fibroblasts.

The A431 cells were treated with gefinitib, which is an EGFR inhibitor and is used (under the trade name Iressa) as a drug to treat cancers with mutated and overactive EGFR. In the experiment, RNA was extracted at four time points: before the gefinitib treatment (t=0), and two, six and twenty-four hours after treatment (t=2, t=6, t=24, respectively), and sequenced using an Illumina HiSeq instrument in triplicates (thus there are 3x4=12 samples).

This data set or parts of it will be used in the labs on read mapping, transcript assembly, visualization, quality control and differential expression. There are many relevant questions that could be asked based on these measurements. In the QC exercise, we are going to examine if the RNA libraries that we work with are what we think they are or if there is any mislabelling. In the isoform exercise, we are going to look at some specific regions where the mass-spectrometry data indicated that novel exons or splice variants could be present at the protein level. We will use (part of) the RNA-seq data to examine if there is corresponding evidence on the mRNA level, and how different software tools could be used to detect novel gene variants.

2.2 Working on Uppmax

The step Quality control, Mapping and Quantification will be run on computing cluster Rackham which is part of UPPMAX. A standard compute node on cluster Rackham has 128 GB of RAM and 20 cores. Therefore, each core gives you 6.4 GB of RAM. We will use 8 cores per person for this session which gives you about 51 GB RAM.

2.2.1 Connecting

Log in to Uppmax in a way so that the generated graphics are exported via the network to your screen. Login in to Uppmax with X-forwarding enabled. This will allow any graphical interface that you start on your compute node to be exported to your computer.

Linux and Mac users will run this on the terminal. Windows users will run this in a tool such as MobaXterm.

sh
ssh -Y username@rackham.uppmax.uu.se

Note that Mac users will need to have XQuartz installed for X11 graphics to work.

Once connected, the console prompt should show something like username@rackham1$. This means you are on the login node. To view current jobs on queue, use squeue -U username. This should show a node number like r120 etc if you have booked. If you have no bookings, see booking section below.

If you have a booked job, interactive job or any job, you can ssh on to that.

sh
ssh -Y username@r120

Your prompt changes to username@r120$. Now you are ready to start running heavy computations.

2.2.2 Booking resources

The code below is valid to run at the start of the day. If you are running it in the middle of a day, you need to decrease the time (-t). Do not run this twice and also make sure you are not running computations on a login node.

Book resources for RNA-Seq day 2.

sh
salloc -A g2019011 -t 08:00:00 -p core -n 8 --reservation=g2019011_2

Book resources for RNA-Seq day 3.

sh
salloc -A g2019011 -t 08:00:00 -p core -n 8 --reservation=g2019011_3

Note that booking resources is only relevant for this course and not something you need to use when you regularly work with UPPMAX.

2.3 Quality control

Before doing any other analysis on mapped RNA-seq reads it is always important to do quality control of your mapped reads and that you do not have any obvious errors in your RNA-seq data.

Quality control

2.4 Mapping

This section contains information on how to map reads to a reference genome using splice-aware aligner STAR and HISAT2.

Mapping reads using STAR

2.5 IGV

Mapped reads in BAM files are visualised using integrated genome viewer.

Using IGV

2.6 Quantification

Gene counts are quantified from BAM files using featureCounts.

Quantification

2.7 Differential gene expression

We find genes that are differentially expressed between our time points.

DGE using DEseq2

2.8 Functional analysis

We will perform functional analysis on the differentially expressed genes to place them into a function context and possibly explain the biological consequences of DE. Methods covered are GSA (Gene set analysis) and GSEA (Gene set enrichment analysis).

Functional analysis

3 Bonus labs

3.1 Exploratory data analyses

This section dives deeper into exploratory analyses PCA and hierarchical clustering.

PCA & Hierarchical clustering

3.2 Pseudoaligners

Kallisto uses FastQ reads and a reference transcriptome (cDNA+ncRNA) to quantify transcripts using rapid pseudo-alignment along with bootstrap replicates to assess quantification inaccuracy. Kallisto is significantly faster than STAR or HISAT2 and has a small memory footprint. Differential gene expression is carried out using Sleuth which utilises bootstrap replicates.

Mapping and quantification using Kallisto, DGE using Sleuth

3.3 small RNA analyses

RNA-seq differential analyses workflow on microRNAs from Fruit fly.

Small RNA-seq analyses

3.4 Assembly & annotation

Raw sequencing short reads are assembled into transcripts using two approaches. Genome-guided assembly using HiSat2 and StringTie. De-novo transcriptome assembly using Trinity. Assembled transcriptomes are functionally annotated to identify genes.

Reference-guided assembly using StringTie
De-novo assembly using Trinity
Transcriptome annotation

End of document