Read quality

1 FastQC

FastQC performes a series of quality control analyses, called modules. The output is a HTML report with one section for each module, and a summary evaluation of the results in the top. Entirely normal results are marked with a green tick, slightly abnormal results raise warnings (orange exclamation mark), and very unusual results raise failures (red cross).

It is important to stress that although the analyses appear to give a pass/fail result, these evaluations must be taken in the context of what you expect from your library. A ‘normal’ sample as far as FastQC is concerned is random and diverse. Some experiments may be expected to produce libraries which are biased in particular ways. You should treat the summary evaluations therefore as pointers to where you should concentrate your attention and understand why your library may not look random and diverse.

2 Data

We will run FastQC on three low-coverage whole genome sequencing (WGS) samples from the public 1000 Genomes project. To speed up the analysis we will only use data from a small genomic region. These are the exact same samples as will be used in the variant-calling workflow lab on Wednesday.

Sample	Description
HG00097	Low coverage WGS
HG00100	Low coverage WGS
HG00101	Low coverage WGS

3 Run FastQC

3.1 Connect to PDC

During this lab it is best to connect to Dardel via a remote desktop (ThinLinc). Please refer to Connecting to PDC, section 2 Remote desktop, for instructions.

3.2 Log on to a node

Check which node you got when you booked resources this morning (replace username with your username)

squeue -u username

should look something like this

user@login1 ~ $ squeue -u user
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
           5583899    shared interact    user  R       2:22      1 nid001009
user@login1 ~ $

where nid001009 is the name of the node (yours will probably be different). Note the numbers in the Time column. They show for how long the job has been running. When it reaches the time limit you requested the session will shut down, and you will lose all unsaved data.

If the list is empty you can run the allocation command again and it should be in the list:

salloc -A edu24.uppmax --reservation=edu24-11-26 -t 04:00:00 -p shared -c 1

Connect to this node from the login node.

ssh -Y nid001009

3.3 Create a workspace

You should work in a subfolder in your home folder on Dardel, just like you have done during the previous labs. Start by going to your course folder using this command:

cd ~/ngsintro

Create a folder for this exercise and move into it:

mkdir qc
cd qc

3.4 Symbolic links to data

The raw data files are located in

/sw/courses/ngsintro/vc/data/fastq

Instead of copying the files to your workspace you should create symbolic links (soft-links) to them. Soft-linking files and folders allows you to work with them as if they were in your current directory, but without multiplying them. Create symbolic links to the fastq files in your workspace:

ln -s /sw/courses/ngsintro/vc/data/fastq/HG00097_1.fq
ln -s /sw/courses/ngsintro/vc/data/fastq/HG00097_2.fq
ln -s /sw/courses/ngsintro/vc/data/fastq/HG00100_1.fq
ln -s /sw/courses/ngsintro/vc/data/fastq/HG00100_2.fq
ln -s /sw/courses/ngsintro/vc/data/fastq/HG00101_1.fq
ln -s /sw/courses/ngsintro/vc/data/fastq/HG00101_2.fq

3.5 Accessing FastQC

FastQC is installed in the module system on PDC. Modules must be loaded every time you login to Dardel, or when you connect to a new compute node.
First load the bioinfo-tools module:

module load bioinfo-tools

This makes it possible to load FastQC:

module load fastqc/0.12.1

3.6 Run FastQC

Run FastQC on all fastq files:

fastqc -q *.fq

The output is .html documents that shows quality scores along the reads, and other information. Please check what new files were generated with the command:

ls -lrt

4 Check the results

The output from FastQC is a HTML report that should be opened in a web browser. When you have connected to PDC via ThinLinc you can open it on Dardel with this command:

firefox --no-remote filename.html &

We have made the output files that you just created available through the links below, so that you can look at them via your local web-browser:

Sample	Read 1	Read 2
HG00097	HG00097_1.fq	HG00097_2.fq
HG00100	HG00100_1.fq	HG00100_2.fq
HG00101	HG00101_1.fq	HG00101_2.fq

4.1 Per Base Sequence Quality

This module shows the distribution of the quality scores at each position in the reads. The quality scores are represented by a Box and Whisker plot with the following elements:

The central red line is the median value.
The yellow box represents the inter-quartile range (25-75%).
The upper and lower whiskers represent the 10% and 90% points
The blue line represents the mean quality

The background of the graph divides the y axis into very good quality calls (green), calls of reasonable quality (orange), and calls of poor quality (red).

4.1.1 Questions

Which positions in the reads have a median phred-score above 28 (very good quality calls) in each sample?
Do any of the samples have warnings or failures in the Per Base Sequence Quality module?
Why? Please look in the documentation of this module.

4.2 Sequence Length Distribution

This module shows the length distribution of the reads in the file.

4.2.1 Questions

How long are the reads?
Do any of the samples have warnings or failures in the Sequence Length Distribution module?
Why? Please look in the documentation of this module

5 Answers

When you have finished the exercise, please have a look at this document with answers to questions, and compare them with your answers.

6 Documentation

FastQC
If you want to learn more details about FastQC please have a look at this video by the Babraham Bioinformatics Institute.