salloc -A edu24.uppmax -t 04:00:00 -p shared -c 4 --no-shell &
1 FastQC
FastQC performes a series of quality control analyses, called modules. The output is a HTML report with one section for each module, and a summary evaluation of the results in the top. Entirely normal results are marked with a green tick, slightly abnormal results raise warnings (orange exclamation mark), and very unusual results raise failures (red cross).
It is important to stress that although the analyses appear to give a pass/fail result, these evaluations must be taken in the context of what you expect from your library. A ‘normal’ sample as far as FastQC is concerned is random and diverse. Some experiments may be expected to produce libraries which are biased in particular ways. You should treat the summary evaluations therefore as pointers to where you should concentrate your attention and understand why your library may not look random and diverse.
2 Data
We will run FastQC on three low-coverage whole genome sequencing (WGS) samples from the public 1000 Genomes project. To speed up the analysis we will only use data from a small genomic region. These are the exact same samples as will be used in the variant-calling workflow lab on Wednesday.
Sample | Description |
---|---|
HG00097 | Low coverage WGS |
HG00100 | Low coverage WGS |
HG00101 | Low coverage WGS |
3 Run FastQC
3.1 Connect to PDC
During this lab it is best to connect to Dardel via a remote desktop (ThinLinc). Please refer to Connecting to PDC, section 2 Remote desktop, for instructions.
3.2 Book a node
To be able to run analyses in the terminal you should book an interactive session on a compute node. Make sure you only do this once each day because we have reserved a limited number of cores per student for the course. If you haven’t already done this today please use this command:
Once your job allocation has been granted (should not take long) you can connect to the node using ssh. To find out the name of your node, use:
squeue -u username
The node name is found under nodelist header, you should only see one. Connect to that node:
ssh -Y <nodename>
3.3 Create a workspace
You should work in a subfolder in your home folder on Dardel, just like you have done during the previous labs. Start by going to your course folder using this command:
cd ~/ngsintro
Create a folder for this exercise and move into it:
mkdir qc
cd qc
3.4 Symbolic links to data
The raw data files are located in
/sw/courses/ngsintro/vc/data/fastq
Instead of copying the files to your workspace you should create symbolic links (soft-links) to them. Soft-linking files and folders allows you to work with them as if they were in your current directory, but without multiplying them. Create symbolic links to the fastq files in your workspace:
ln -s /sw/courses/ngsintro/vc/data/fastq/HG00097_1.fq
ln -s /sw/courses/ngsintro/vc/data/fastq/HG00097_2.fq
ln -s /sw/courses/ngsintro/vc/data/fastq/HG00100_1.fq
ln -s /sw/courses/ngsintro/vc/data/fastq/HG00100_2.fq
ln -s /sw/courses/ngsintro/vc/data/fastq/HG00101_1.fq
ln -s /sw/courses/ngsintro/vc/data/fastq/HG00101_2.fq
3.5 Accessing FastQC
FastQC is installed in the module system on PDC. Modules must be loaded every time you login to Dardel, or when you connect to a new compute node.
First load the bioinfo-tools module:
module load bioinfo-tools
This makes it possible to load FastQC:
module load fastqc/0.12.1
3.6 Run FastQC
Run FastQC on all fastq files:
fastqc -q *.fq
The output is .html documents that shows quality scores along the reads, and other information. Please check what new files were generated with the command:
ls -lrt
4 Check the results
The output from FastQC is a HTML report that should be opened in a web browser. When you have connected to PDC via ThinLinc you can open it on Dardel with this command:
firefox --no-remote filename.html &
We have made the output files that you just created available through the links below, so that you can look at them via your local web-browser:
Sample | Read 1 | Read 2 |
---|---|---|
HG00097 | HG00097_1.fq | HG00097_2.fq |
HG00100 | HG00100_1.fq | HG00100_2.fq |
HG00101 | HG00101_1.fq | HG00101_2.fq |
4.1 Per Base Sequence Quality
This module shows the distribution of the quality scores at each position in the reads. The quality scores are represented by a Box and Whisker plot with the following elements:
- The central red line is the median value.
- The yellow box represents the inter-quartile range (25-75%).
- The upper and lower whiskers represent the 10% and 90% points
- The blue line represents the mean quality
The background of the graph divides the y axis into very good quality calls (green), calls of reasonable quality (orange), and calls of poor quality (red).
4.1.1 Questions
- Which positions in the reads have a median phred-score above 28 (very good quality calls) in each sample?
- Do any of the samples have warnings or failures in the Per Base Sequence Quality module?
- Why? Please look in the documentation of this module.
4.2 Sequence Length Distribution
This module shows the length distribution of the reads in the file.
4.2.1 Questions
- How long are the reads?
- Do any of the samples have warnings or failures in the Sequence Length Distribution module?
- Why? Please look in the documentation of this module
5 Answers
When you have finished the exercise, please have a look at this document with answers to questions, and compare them with your answers.
6 Documentation
- FastQC
- If you want to learn more details about FastQC please have a look at this video by the Babraham Bioinformatics Institute.