Managing your data

21-Aug-2025

Data (mis)management in practice

	Raw data	Metadata
Data acquisition	Data arrives in cumbersome and proprietary format	In researcher’s lab journal
Analysis	Gets converted to format of choice. Original files (and conversion settings) are lost	Hard-coded in various analysis scripts
First submission		Mailed back and forth between collaborators in ever-changing (but nicely coloured) Excel sheets
Review	Leads a quiet life on the HPC cluster, until the project expires and the data has to be urgently retrieved
Second submission	Ends its days on an external hard drive on the researcher’s desk	Reformatted and included as PDF in the supplementary
Publication	“Data available upon request”

FAIR data

Strive to make your data FAIR¹ for both machines and humans:

Findable
Accessible
Interoperable
Reusable

Data management plan

Check requirements of funding agency and field of research ¹
Determine required storage space for short and long term
Provide helpful metadata
Consider legal/ethical restrictions if working with sensitive data
Find suitable data repositories
Strive towards uploading data to its final destination at the beginning of a project

Organising your projects

Which sample file represents the most up to date version?

$ ls -l data/
-rw-r--r--  user  staff  Nov 12 22:00 samples.tsv
-rw-r--r--  user  staff  Nov 16 11:39 samplesFinal.tsv
-rw-r--r--  user  staff  Nov 18 22:41 samplesFinalV2.tsv
-rw-r--r--  user  staff  Nov 18 13:25 samplesUSE_THIS_ONE.tsv
-rw-r--r--  user  staff  Nov 15 22:39 samplesV2.tsv

The project directory

The first step towards working reproducibly: Get organised!

Divide your work into distinct projects
Keep all files needed to go from raw data to final results in a dedicated directory
Use relevant subdirectories

There are many ways to organise a project

A simple but effective example is the following:

code/             Code needed to go from input files to final results
data/             Raw data - this should never edited
doc/              Documentation of the project
env/              Environment-related files, e.g. Conda environments or Dockerfiles
results/          Output from workflows and analyses
README.md         Project description and instructions

There are many ways to organise a project

A Snakemake-based example: snakemake-workflows/template

config/
  config.yaml
results/
resources/
workflow/
  rules/
    module1.smk
  envs/
    tool1.yaml
  scripts/
    script1.py
  Snakefile
LICENSE.md
README.md

There are many ways to organise a project

A Nextflow-based example: fasterius/nbis-support-template

bin/
data/
doc/
env/
results/
README.md
LICENSE
main.nf
nextflow.config

Helpful tools

Syntax highlighting, autocomplete, Git integration, etc.:

Working in an HPC over SSH in the command line:

Questions?

Topics for discussion

Do you organise your work in distinct projects?
How do you organise your files in this context?
Are you happy with the way you work today?
Does your group have a data management plan in place?
Do you know “your” repositories and how to submit data to them?

Managing your data

Data (mis)management in practice

FAIR data

Data management plan

Data sharing

Organising your projects

The project directory

There are many ways to organise a project

There are many ways to organise a project

There are many ways to organise a project

Helpful tools

Questions?

Topics for discussion