In this tutorial we will go through some of the key steps in performing a quality control on your samples. We will start with the read based quality control, using FastQC. This will analyse your RNA fragments and see if there are any biases

All the data you need for this lab is available in the subfolder fastq of the data folder that you have downloaded to this course .

1 FastQC on single file

FastQC aims to provide a simple way to do some quality control checks on raw sequence data coming from high throughput sequencing pipelines. It provides a modular set of analyses which you can use to get a quick impression of whether your data has any problems of which you should be aware before doing any further analysis.

The main functions of FastQC are:

  • Import of data from BAM, SAM or FASTQ files (any variant)
  • Providing a quick overview to tell you in which areas there may be problems
  • Summary graphs and tables to quickly assess your data
  • Export of results to an HTML-based permanent report
  • Offline operation to allow automated generation of reports without running the interactive application

You can read more about the program and have a look at example reports at the FastQC website.

This program can be used for any type of NGS data, not only RNA-seq.

1.1 How to run FastQC

sh

# To see help information on the FastQC package:
fastqc --help

# Run for one FASTQ file:
fastqc -o outdir fastqfile

# Run on multiple FASTQ files:
fastqc -o outdir fastqfile1 fastqfile2 etc.

# You can use wildcards to run on all FASTQ files in a directory:
fastqc -o outdir /the/folder/where/you/have/your/data/*.fastq(.gz)

Take a look at some other file and see if it looks similar in quality.

2 MultiQC on FastQC reports

MultiQC is a program that creates summaries over all samples for several different types of QC-measures. You can read more about the program here. It will automatically look for output files from the supported tools and make a summary of them. You can either go to the folder where you have the Fastqc output or run it with the path to the folder.

sh

multiqc /folder/with/FastQC_results/

This should create a folder named multiqc_data with some general stats, links to all files etc. And one file named multiqc_report.html.

Have a look at the report you created.


3 Code to run all the steps above on data on course data

sh


###############################################################################
###          PRE-MAPPING QC ANALYSIS                                     #####
###############################################################################

################
###  FASTQC  ###
################

cd ~/RNAseq

for i in $(ls rawData/fastqFiles/*.fq.gz); do

    fastqc \
      -o results/01-preAlignmentQC/ \
      $i &

done


wait


################
###  MultiQC  ###
################

cd ~/RNAseq/results/01-preAlignmentQC/
multiqc .