FastQC performes a series of quality control analyses, called modules. The output is a HTML report with one section for each module, and a summary evaluation of the results in the top. Entirely normal results are marked with a green tick, slightly abnormal results raise warnings (orange exclamation mark), and very unusual results raise failures (red cross).
It is important to stress that although the analyses appear to give a pass/fail result, these evaluations must be taken in the context of what you expect from your library. A ‘normal’ sample as far as FastQC is concerned is random and diverse. Some experiments may be expected to produce libraries which are biased in particular ways. You should treat the summary evaluations therefore as pointers to where you should concentrate your attention and understand why your library may not look random and diverse.
We will run FastQC on three low-coverage whole genome sequencing (WGS) samples from the public 1000 Genomes project. To speed up the analysis we will only use data from a small genomic region. These are the exact same samples as will be used in the variant-calling workflow lab on Wednesday.
Sample | Description |
---|---|
HG00097 | Low coverage WGS |
HG00100 | Low coverage WGS |
HG00101 | Low coverage WGS |
During this lab it is best to connect to UPPMAX via a remote desktop (ThinLinc). Instructions for this is available in Canvas under Contents > Additional content > Connecting to UPPMAX. Please follow the instructions in section 1.2 Remote desktop connection.
To be able to run analyses in the terminal you should book a compute node (or in this case just one core of a node). Make sure you only do this once each day because we have reserved one core per student for the course. If you haven’t already reserved a core today please use this command:
salloc -A snic2022-22-769 -t 04:00:00 -p core -n 1 --no-shell --reservation=snic2022-22-769_2 &
Once your job allocation has been granted (should not take long) you can connect to the node using ssh. To find out the name of your node, use:
squeue -u username
The node name is found under nodelist header, you should only see one. Connect to that node:
ssh -Y <nodename>
You should work in your folder under the course’s nobackup folder, just like you have done during the previous labs. Start by creating a workspace for this exercise in your folder, and then move into it.
mkdir /proj/snic2022-22-769/nobackup/username/qc
cd /proj/snic2022-22-769/nobackup/username/qc
The raw data files are located in
/sw/courses/ngsintro/reseq/data/fastq
Instead of copying the files to your workspace you should create symbolic links (soft-links) to them. Soft-linking files and folders allows you to work with them as if they were in your current directory, but without multiplying them. Create symbolic links to the fastq files in your workspace:
ln -s /sw/courses/ngsintro/reseq/data/fastq/HG00097_1.fq
ln -s /sw/courses/ngsintro/reseq/data/fastq/HG00097_2.fq
ln -s /sw/courses/ngsintro/reseq/data/fastq/HG00100_1.fq
ln -s /sw/courses/ngsintro/reseq/data/fastq/HG00100_2.fq
ln -s /sw/courses/ngsintro/reseq/data/fastq/HG00101_1.fq
ln -s /sw/courses/ngsintro/reseq/data/fastq/HG00101_2.fq
FastQC is installed in the module system on UPPMAX. Modules must be loaded every time you login to Rackham, or when you connect to a new compute node.
First load the bioinfo-tools module:
module load bioinfo-tools
This makes it possible to load FastQC:
module load FastQC/0.11.8
Run FastQC on all fastq files:
fastqc -q *.fq
The output is .html documents that shows quality scores along the reads, and other information. Please check what new files were generated with the command:
ls -lrt
The output from FastQC is a HTML report that should be opened in a web browser. When you have connected to Uppmax via ThinLinc you can open it on Rackham with this command:
firefox --no-remote filename.html &
We have made the output files that you just created available through the links below, so that you can look at them via your local web-browser:
Sample | Read 1 | Read 2 |
---|---|---|
HG00097 | HG00097_1.fq | HG00097_2.fq |
HG00100 | HG00100_1.fq | HG00100_2.fq |
HG00101 | HG00101_1.fq | HG00101_2.fq |
This module shows the distribution of the quality scores at each position in the reads. The quality scores are represented by a Box and Whisker plot with the following elements:
The background of the graph divides the y axis into very good quality calls (green), calls of reasonable quality (orange), and calls of poor quality (red).
This module shows the length distribution of the reads in the file.
When you have finished the exercise, please have a look at this document with answers to questions, and compare them with your answers.