Managing your data

29-Nov-2024

Data (mis)management in practice

Raw data Metadata
Data acquisition Data arrives in cumbersome and proprietary format In researcher’s lab journal
Analysis Gets converted to format of choice. Original files (and conversion settings) are lost Hard-coded in various analysis scripts
First submission Mailed back and forth between collaborators in ever-changing (but nicely coloured) Excel sheets
Review Leads a quiet life on the HPC cluster, until the project expires and the data has to be urgently retrieved
Second submission Ends its days on an external hard drive on the researcher’s desk Reformatted and included as PDF in the supplementary
Publication “Data available upon request”

FAIR data

Strive to make your data FAIR1 for both machines and humans:

  • Findable
  • Accessible
  • Interoperable
  • Reusable

Data management plan

  • Check requirements of funding agency and field of research 1
  • Determine required storage space for short and long term
  • Provide helpful metadata
  • Consider legal/ethical restrictions if working with sensitive data
  • Find suitable data repositories
  • Strive towards uploading data to its final destination at the beginning of a project

Data sharing

Why Open Access?

  • Publicly funded research should be unrestricted
  • Published results should be verifiable by others
  • Enables other to build upon previous work

Organising your projects

Which sample file represents the most up to date version?

$ ls -l data/
-rw-r--r--  user  staff  Nov 12 22:00 samples.tsv
-rw-r--r--  user  staff  Nov 16 11:39 samplesFinal.tsv
-rw-r--r--  user  staff  Nov 18 22:41 samplesFinalV2.tsv
-rw-r--r--  user  staff  Nov 18 13:25 samplesUSE_THIS_ONE.tsv
-rw-r--r--  user  staff  Nov 15 22:39 samplesV2.tsv

The project directory

The first step towards working reproducibly: Get organised!

  • Divide your work into distinct projects
  • Keep all files needed to go from raw data to final results in a dedicated directory
  • Use relevant subdirectories

There are many ways to organise a project

A simple but effective example is the following:

code/             Code needed to go from input files to final results
data/             Raw data - this should never edited
doc/              Documentation of the project
env/              Environment-related files, e.g. Conda environments or Dockerfiles
results/          Output from workflows and analyses
README.md         Project description and instructions

There are many ways to organise a project

A Snakemake-based example: snakemake-workflows/template

config/
  config.yaml
results/
resources/
workflow/
  rules/
    module1.smk
  envs/
    tool1.yaml
  scripts/
    script1.py
  Snakefile
LICENSE.md
README.md

There are many ways to organise a project

A Nextflow-based example: fasterius/nbis-support-template

bin/
data/
doc/
env/
results/
README.md
LICENSE
main.nf
nextflow.config

Helpful tools

Syntax highlighting, autocomplete, Git integration, etc.:

Working in an HPC over SSH in the command line:

Questions?

Topics for discussion

  • Do you organise your work in distinct projects?
  • How do you organise your files in this context?
  • Are you happy with the way you work today?
  • Does your group have a data management plan in place?
  • Do you know “your” repositories and how to submit data to them?