3 Analysis
Sequence data has arrived. What next?
3.1 Checklist
3.2 How to:
Structure data
All data should be structured in the data/
folder as follows to make data findable.
data/
| - deliveries
| - raw-data
| - outputs \ - frozen
data/deliveries
contains read-only folders of deliveries from the sequencing centers.data/raw-data
contains subfolders that indicate the sequence data type, e.g.PacBio-Revio-WGS
and within those folders are symlinks to data in subfolders ofdata/deliveries
. Additional data downloaded from other locations should also be in subfolders and include a shell script that can redownload the file from the original location. e.g.,#! /usr/bin/env bash curl -O ftp://path/to/public/archive/file.ext
data/outputs
contains the results from tools lauched in the analyses folders using the same label as the launching folder. e.g.,analyses/01_assembly-workflow_initial-run_rackham
puts results indata/outputs/01_assembly-workflow_initial-run_rackham
.data/frozen
contains symlinks to folders indata/outputs
which are stage end-points, e.g. the raw-reads have been processed in various ways, and after looking at QC controls, one folder is selected to be used for assembly. This is symlinked in frozen.
Link data between folders
Data in data/raw-data/*
should be symlinked from data/deliveries/**
.
E.g.,
cd data/raw-data/PacBio-Hifi
find ../../deliveries -name "*.bam" -o -name "*.pbi" -exec ln -s {} . \;
and
cd data/raw-data/Illumina-HiC
find ../../deliveries -name "*.fastq.gz" -exec ln -s {} . \;
Assemble sequence data
cd analyses/01_ebp-assembly-workflow
- Update the
assembly_parameters.yml
with the paths to input files. Check the YML for a bash snippet to fill out the section. - Update the
workflow_parameters.yml
with any extra workflow parameters. In particular, check anything marked as TODO, e.g., selecting the mitochondrial code table to use. - Run the workflow.
./run_nextflow.sh
The workflow above only runs until Hi-C mapping. Steps for manual curation onwards are still being implemented.
Annotate assemblies
No protocols as of yet
Perform downstream analyses
No protocols as of yet
Integrate new analyses
Custom analyses might be needed. In these cases please make use of the other project folders, and do your best to version control all the steps.
- Put custom code in
code/scripts
,code/snakemake
, andcode/nextflow
. - Make sure the code uses containers or conda environments to package the software environment.
- Make an issue on the template to integrate the code into the template so that it’s shareable until it’s integrated into a workflow.
- Make an issue on the relevant workflow to integrate the tools.
Troubleshoot
If you encounter any issues with using these protocols please ask on #vr-accessibility-ebp.