Welcome to
Grapevine, the Workflow Manager inspired by Natural Language
Processing!

Download the source
code, user manual, and grammar specification.
1. What is it?
Grapevine is an automated workflow management system primarily
designed for implementing bioinformatics pipelines. It follows the
double abstract principle, by which the input language is
freely definable, as are all tools used in a pipeline.
2. How to
install?
You will need a Linux system (MacOS X is officially not
supported, yet it will likely work as well), and gcc 5.4 or
higher. Clone the repository by typing git clone
https://github.com/NBISweden/Grapevine.git in the command
line, then type cd
Grapevine and make -j 4, and wait for the compilation to
finish.
3. What can Grapevine
do for you?
With the advent of high throughput sequencing technologies,
large data sets have become available. The road from the data as
it comes off of the machines to the scientific interpretation on
the experiments is often long and winding, however. Typically,
there are several computational steps that process and re-process
information, sometimes in a linear parallel chain, sometimes
operations are done on all samples at once. For reproducibility,
it is not only required to document each step, but also to ensure
that this process can be repeated over and over again, by exactly
replicating the findings.
A simple solution is to implement a processing pipeline as a shell
script, and this works fine for simple problems. But what is ever
simple in modern Life Sciences, where samples are generated by the
hundreds or thousands? In response, workflow languages were
implemented, such as NextFlow
and SnakeMake.
While these programs allow for handling large data sets, their use
still requires a certain level of expertise and scripting skills,
since the complexity of the data processing still needs to be
modeled in a script. Moreover, every time a program in the
pipeline changes its name or interface, all scripts have to be
adjusted.
Grapevine's goal is to completely eliminate a script's dependency
on the underlying programs or modules, while at the same time
providing a consistent front-end to the user. This design
philosophy goes as far as allowing for one script to be used on
many different data sets without requiring a single edit.
Knowledge about the inner workings of programs is no longer
required, neither are specific programming skills.
3. How did we do it?
Grapevine is a double-abstract communication
layer that translates any input syntax into any output syntax.
This paradigm is not new, in fact, it has its roots in the
natural language processing that was developed in the last
century. The concept is simple: a context-free
grammar, a kind of directed graph, defines a legal syntax,
i.e. a legal command. So rather than typing:
bowtie2 -p
16 -x mm10/mm10 -q -1
20170828BK_MC/HA_H33TAG_ctrl_R1.dedup.fastq -2
20170828BK_MC/HA_H33TAG_ctrl_R2.dedup.fastq --fast -S
20170828BK_MC/HA_H33TAG_ctrl.sam
samtools view -bS -F 4 20170828BK_MC/HA_H33TAG_ctrl.sam >
20170828BK_MC/HA_H33TAG_ctrl.bam
You want to simply say: "map my reads to the mm10 reference
genome in the mouse/ folder and save the alignments as a bam
file, you can find the fastq files in a table under row
fastq_file".
And in fact you could.
While the example above certainly works, we propose a more
formalized and structured syntax for practical purposes, for
example: "map file @table.fastq_file to reference @ref
> @bam". This syntax is defined is defined in a
grammar, as is the translation to the command above. As you can
see, the implementation of the functionality - map reads to a
reference - is abstracted away from the user, and the aligner
bowtie2 can be replaced with any other aligner without affecting
the script. The variables @ref and the data table are provided
by the user so that this command is universal for different data
sets.
That is, of course, just the beginning. By providing the data
and additional info in a plain text table, Grapevine is aware of
multiple samples and will process them in parallel. Depending on
how the scripts are written, files can be processed individually,
or all at once.
To be continued...