Scaffolding genomes is usually only performed on large or complex genomes. As a result, this makes it difficult to provide a dataset that runs in a short time on the given resources. For this task, we provide you with links to familiarise yourself with so you can try these tools on your own resources at a later date (this may be a good test to determine if your own resources are sufficient, develop expectations, and perhaps even write a workflow).
Firstly, datasets can be hard to find. Here are some datasets that are publicly available.
NBIS Genome Assembly Workshop Wiki - Datasets
Next, tools to process that data are required. These should give you a good starting base:
Long read scaffolding with Links
10X Genomics scaffolding with Arcs
For the task, take look at the Datasets page, follow the link to the Illumina Ecoli ERX008638 dataset.
Start a download of read 1 from the ftp link directly to your workspace with the wget
command (replace
$URL
with the link you copied).
wget "$URL"
10X Genomics is a fairly recent platform designed for assembling large genomes, but also with phased output.
Use the Tiny
dataset and run the supernova
pipeline.
module load bioinfo-tools bcl2fastq/2.20.0
PATH="$PATH:/proj/sllstore2017027/workshop-GA2018/tools/supernova-2.1.1"
Make a copy of the Tiny dataset files in your own working directory:
/proj/sllstore2017027/workshop-GA2018/data/10x_tiny
Extract the tar archive using the tar
utility.
tar xvf tiny-bcl-2.0.0.tar.gz
Supernova needs to covert the raw Illumina data to fastq. Use the supernova mkfastq
module to do this.
supernova mkfastq --run tiny-bcl-2.0.0 --id=tiny-bcl --samplesheet=tiny-bcl-samplesheet-2.1.0.csv
Now you can run Supernova on the fastq you just called.
supernova run --id=tiny --maxreads=1200000000 --fastqs tiny-bcl/outs/fastq_path/
Supernova offers various styles of output of the assembly. Take a look at the Supernova Support Page to understand the different kinds of output, and generate one of them.
Write a slurm script for calculating the data quantity in a fastq file. Incorporate the SLURM_ARRAY_TASK_ID
variable.
As we are working with limited resources, run this script in the terminal on your reserved node, instead of submitting to the slurm queue.
Normally Slurm will set $SLURM_ARRAY_TASK_ID
for you when you use the -a
option, but since we are running in the
terminal, we should set it ourselves and use the export
command to make sure the script see’s it.
export SLURM_ARRAY_TASK_ID=0
# Make your executable.
chmod 755 script.sh
# run the script by using the folder location ( ./ ) and the name of the script
./script.sh
Further reading: