Scaffolding genomes is usually only performed on large or complex genomes. As a result, this makes it difficult to provide a dataset that runs in a short time on the given resources. For this task, we provide you with links to familiarise yourself with so you can try these tools on your own resources at a later date (this may be a good test to determine if your own resources are sufficient, develop expectations, and perhaps even write a workflow).
Firstly, datasets can be hard to find. Here are some datasets that are publicly available.
NBIS Genome Assembly Workshop Wiki - Datasets
Next, tools to process that data are required. These should give you a good starting base:
Long read scaffolding with Links
10X Genomics scaffolding with Arcs
For the task, take look at the Datasets page, follow the link to the Illumina Ecoli ERX008638 dataset.
Start a download of read 1 from the ftp link directly to your workspace with the wget
command (replace
$URL
with the link you copied).
10X Genomics is a fairly recent platform designed for assembling large genomes, but also with phased output.
Use the Tiny
dataset and run the supernova
pipeline.
Make a copy of the Tiny dataset files in your own working directory:
Extract the tar archive using the tar
utility.
Supernova needs to covert the raw Illumina data to fastq. Use the supernova mkfastq
module to do this.
Now you can run Supernova on the fastq you just called.
Supernova offers various styles of output of the assembly. Take a look at the Supernova Support Page to understand the different kinds of output, and generate one of them.
Write a slurm script for calculating the data quantity in a fastq file. Incorporate the SLURM_ARRAY_TASK_ID
variable.
As we are working with limited resources, run this script in the terminal on your reserved node, instead of submitting to the slurm queue.
Normally Slurm will set $SLURM_ARRAY_TASK_ID
for you when you use the -a
option, but since we are running in the
terminal, we should set it ourselves and use the export
command to make sure the script see’s it.
Further reading: