Read quality

Quality control of short reads with FastQC
Author

Malin Larsson

Published

14-Nov-2024

1 FastQC

FastQC performes a series of quality control analyses, called modules. The output is a HTML report with one section for each module, and a summary evaluation of the results in the top. Entirely normal results are marked with a green tick, slightly abnormal results raise warnings (orange exclamation mark), and very unusual results raise failures (red cross).

It is important to stress that although the analyses appear to give a pass/fail result, these evaluations must be taken in the context of what you expect from your library. A ‘normal’ sample as far as FastQC is concerned is random and diverse. Some experiments may be expected to produce libraries which are biased in particular ways. You should treat the summary evaluations therefore as pointers to where you should concentrate your attention and understand why your library may not look random and diverse.

2 Data

We will run FastQC on three low-coverage whole genome sequencing (WGS) samples from the public 1000 Genomes project. To speed up the analysis we will only use data from a small genomic region. These are the exact same samples as will be used in the variant-calling workflow lab on Wednesday.

Sample Description
HG00097 Low coverage WGS
HG00100 Low coverage WGS
HG00101 Low coverage WGS

3 Run FastQC

3.1 Connect to PDC

During this lab it is best to connect to Dardel via a remote desktop (ThinLinc). Please refer to Connecting to PDC, section 2 Remote desktop, for instructions.

3.2 Book a node

To be able to run analyses in the terminal you should book an interactive session on a compute node. Make sure you only do this once each day because we have reserved a limited number of cores per student for the course. If you haven’t already done this today please use this command:

salloc -A edu24.uppmax -t 04:00:00 -p shared -c 4 --no-shell &

Once your job allocation has been granted (should not take long) you can connect to the node using ssh. To find out the name of your node, use:

squeue -u username

The node name is found under nodelist header, you should only see one. Connect to that node:

ssh -Y <nodename>

3.3 Create a workspace

You should work in a subfolder in your home folder on Dardel, just like you have done during the previous labs. Start by going to your course folder using this command:

cd ~/ngsintro

Create a folder for this exercise and move into it:

mkdir qc
cd qc

3.5 Accessing FastQC

FastQC is installed in the module system on PDC. Modules must be loaded every time you login to Dardel, or when you connect to a new compute node.
First load the bioinfo-tools module:

module load bioinfo-tools

This makes it possible to load FastQC:

module load fastqc/0.12.1

3.6 Run FastQC

Run FastQC on all fastq files:

fastqc -q *.fq

The output is .html documents that shows quality scores along the reads, and other information. Please check what new files were generated with the command:

ls -lrt

4 Check the results

The output from FastQC is a HTML report that should be opened in a web browser. When you have connected to PDC via ThinLinc you can open it on Dardel with this command:

firefox --no-remote filename.html &

We have made the output files that you just created available through the links below, so that you can look at them via your local web-browser:

Sample Read 1 Read 2
HG00097 HG00097_1.fq HG00097_2.fq
HG00100 HG00100_1.fq HG00100_2.fq
HG00101 HG00101_1.fq HG00101_2.fq

4.1 Per Base Sequence Quality

This module shows the distribution of the quality scores at each position in the reads. The quality scores are represented by a Box and Whisker plot with the following elements:

  • The central red line is the median value.
  • The yellow box represents the inter-quartile range (25-75%).
  • The upper and lower whiskers represent the 10% and 90% points
  • The blue line represents the mean quality

The background of the graph divides the y axis into very good quality calls (green), calls of reasonable quality (orange), and calls of poor quality (red).

4.1.1 Questions

  1. Which positions in the reads have a median phred-score above 28 (very good quality calls) in each sample?
  2. Do any of the samples have warnings or failures in the Per Base Sequence Quality module?
  3. Why? Please look in the documentation of this module.

4.2 Sequence Length Distribution

This module shows the length distribution of the reads in the file.

4.2.1 Questions

  1. How long are the reads?
  2. Do any of the samples have warnings or failures in the Sequence Length Distribution module?
  3. Why? Please look in the documentation of this module

5 Answers

When you have finished the exercise, please have a look at this document with answers to questions, and compare them with your answers.

6 Documentation