Read quality

Quality control of short reads with FastQC
Author

Malin Larsson

Published

16-Nov-2023

1 FastQC

FastQC performes a series of quality control analyses, called modules. The output is a HTML report with one section for each module, and a summary evaluation of the results in the top. Entirely normal results are marked with a green tick, slightly abnormal results raise warnings (orange exclamation mark), and very unusual results raise failures (red cross).

It is important to stress that although the analyses appear to give a pass/fail result, these evaluations must be taken in the context of what you expect from your library. A ‘normal’ sample as far as FastQC is concerned is random and diverse. Some experiments may be expected to produce libraries which are biased in particular ways. You should treat the summary evaluations therefore as pointers to where you should concentrate your attention and understand why your library may not look random and diverse.

2 Data

We will run FastQC on three low-coverage whole genome sequencing (WGS) samples from the public 1000 Genomes project. To speed up the analysis we will only use data from a small genomic region. These are the exact same samples as will be used in the variant-calling workflow lab on Wednesday.

Sample Description
HG00097 Low coverage WGS
HG00100 Low coverage WGS
HG00101 Low coverage WGS

3 Run FastQC

3.1 Connect to Uppmax

During this lab it is best to connect to UPPMAX via a remote desktop (ThinLinc). Instructions for this is available in Canvas under Workshop > Contents > Useful resources > Connecting to UPPMAX. Please follow the instructions in section 2 Remote desktop connection.

3.2 Book a node

To be able to run analyses in the terminal you should book a compute node (or in this case just one core of a node). Make sure you only do this once each day because we have reserved one core per student for the course. If you haven’t already reserved a core today please use this command:

salloc -A naiss2023-22-862 -t 04:00:00 -p core -n 1 --no-shell &

Once your job allocation has been granted (should not take long) you can connect to the node using ssh. To find out the name of your node, use:

squeue -u username

The node name is found under nodelist header, you should only see one. Connect to that node:

ssh -Y <nodename>

3.3 Create a workspace

You should work in your folder under the course’s nobackup folder, just like you have done during the previous labs. Start by creating a workspace for this exercise in your folder, and then move into it.

mkdir /proj/naiss2023-22-862/nobackup/username/qc
cd /proj/naiss2023-22-862/nobackup/username/qc

3.5 Accessing FastQC

FastQC is installed in the module system on UPPMAX. Modules must be loaded every time you login to Rackham, or when you connect to a new compute node.
First load the bioinfo-tools module:

module load bioinfo-tools

This makes it possible to load FastQC:

module load FastQC/0.11.8

3.6 Run FastQC

Run FastQC on all fastq files:

fastqc -q *.fq

The output is .html documents that shows quality scores along the reads, and other information. Please check what new files were generated with the command:

ls -lrt

4 Check the results

The output from FastQC is a HTML report that should be opened in a web browser. When you have connected to Uppmax via ThinLinc you can open it on Rackham with this command:

firefox --no-remote filename.html &

We have made the output files that you just created available through the links below, so that you can look at them via your local web-browser:

Sample Read 1 Read 2
HG00097 HG00097_1.fq HG00097_2.fq
HG00100 HG00100_1.fq HG00100_2.fq
HG00101 HG00101_1.fq HG00101_2.fq

4.1 Per Base Sequence Quality

This module shows the distribution of the quality scores at each position in the reads. The quality scores are represented by a Box and Whisker plot with the following elements:

  • The central red line is the median value.
  • The yellow box represents the inter-quartile range (25-75%).
  • The upper and lower whiskers represent the 10% and 90% points
  • The blue line represents the mean quality

The background of the graph divides the y axis into very good quality calls (green), calls of reasonable quality (orange), and calls of poor quality (red).

4.1.1 Questions

  1. Which positions in the reads have a median phred-score above 28 (very good quality calls) in each sample?
  2. Do any of the samples have warnings or failures in the Per Base Sequence Quality module?
  3. Why? Please look in the documentation of this module.

4.2 Sequence Length Distribution

This module shows the length distribution of the reads in the file.

4.2.1 Questions

  1. How long are the reads?
  2. Do any of the samples have warnings or failures in the Sequence Length Distribution module?
  3. Why? Please look in the documentation of this module

5 Answers

When you have finished the exercise, please have a look at this document with answers to questions, and compare them with your answers.

6 Documentation