Welcome to Grapevine, the Workflow Manager inspired by Natural Language Processing!
Grapevine 3

Download the source code, user manual, and grammar specification.


1. What is it?

Grapevine is an automated workflow management system primarily designed for implementing bioinformatics pipelines. It follows the double abstract principle, by which the input language is freely definable, as are all tools used in a pipeline. 


2. How to install?

You will need a Linux system (MacOS X is officially not supported, yet it will likely work as well), and gcc 5.4 or higher. Clone the repository by typing git clone https://github.com/NBISweden/Grapevine.git in the command line, then type cd Grapevine and make -j 4, and wait for the compilation to finish.


3. What can Grapevine do for you?

With the advent of high throughput sequencing technologies, large data sets have become available. The road from the data as it comes off of the machines to the scientific interpretation on the experiments is often long and winding, however. Typically, there are several computational steps that process and re-process information, sometimes in a linear parallel chain, sometimes operations are done on all samples at once. For reproducibility, it is not only required to document each step, but also to ensure that this process can be repeated over and over again, by exactly replicating the findings.

A simple solution is to implement a processing pipeline as a shell script, and this works fine for simple problems. But what is ever simple in modern Life Sciences, where samples are generated by the hundreds or thousands? In response, workflow languages were implemented, such as NextFlow and SnakeMake. While these programs allow for handling large data sets, their use still requires a certain level of expertise and scripting skills, since the complexity of the data processing still needs to be modeled in a script. Moreover, every time a program in the pipeline changes its name or interface, all scripts have to be adjusted.

Grapevine's goal is to completely eliminate a script's dependency on the underlying programs or modules, while at the same time providing a consistent front-end to the user. This design philosophy goes as far as allowing for one script to be used on many different data sets without requiring a single edit. Knowledge about the inner workings of programs is no longer required, neither are specific programming skills.


3. How did we do it?
 

Grapevine is a double-abstract communication layer that translates any input syntax into any output syntax. This paradigm is not new, in fact, it has its roots in the natural language processing that was developed in the last century. The concept is simple: a context-free grammar, a kind of directed graph, defines a legal syntax, i.e. a legal command. So rather than typing: 

bowtie2 -p 16 -x mm10/mm10 -q -1 20170828BK_MC/HA_H33TAG_ctrl_R1.dedup.fastq -2 20170828BK_MC/HA_H33TAG_ctrl_R2.dedup.fastq --fast -S 20170828BK_MC/HA_H33TAG_ctrl.sam
samtools view -bS -F 4 20170828BK_MC/HA_H33TAG_ctrl.sam > 20170828BK_MC/HA_H33TAG_ctrl.bam

You want to simply say: "map my reads to the
mm10 reference genome in the mouse/ folder and save the alignments as a bam file, you can find the fastq files in a table under row fastq_file".

And in fact you could.

While the example above certainly works, we propose a more formalized and structured syntax for practical purposes, for example: "map file @table.fastq_file to reference @ref  > @bam". This syntax is defined is defined in a grammar, as is the translation to the command above. As you can see, the implementation of the functionality - map reads to a reference - is abstracted away from the user, and the aligner bowtie2 can be replaced with any other aligner without affecting the script. The variables @ref and the data table are provided by the user so that this command is universal for different data sets.

That is, of course, just the beginning. By providing the data and additional info in a plain text table, Grapevine is aware of multiple samples and will process them in parallel. Depending on how the scripts are written, files can be processed individually, or all at once.

To be continued...