Planning for data publication

Licenced under CC-BY 4.0 and OSI-approved licenses, see licensing.

Overview

Teaching: 5 min
Exercises: 30-60 min
Questions
  • How do I create a data publication plan?

Objectives
  • After the exercise you will have exerience of how to find a suitable repository, and to find out what is required for a submission.

Exercise - make a plan for data publication

One of the questions that needs to be addressed in a Data Management Plan (DMP) is “How, when and where will research data or information about data (metadata) be made accessible?”. Since the first version of a DMP is to be written before a project begins, this means that you have to have a plan for data sharing early on. Also, different repositories have different requirements regarding which metadata to provide and which formats to use when submitting. Knowing the requirements early on in a project, allows you to collect and format the appropriate metadata already while working on the project, when it is fresh in yours and your collaborators minds. This will greatly reduce the time spent on submission.

During this exercise you will create a plan for making research data available to the public, by identifying a suitable repository and finding out what the requirements are for submission. Three data output scenarios will be presented.

Data publication plan

Things to consider when making a data publication plan:

1. What types of outputs will you be creating or collecting?

Make a list of the outputs, start with the main, most important, data type(s) for the study, since these will indicate which repository is suitable.

2. What are suitable repositories for your outputs?

Depositing your data in a publicly accessible recognised repository which assigns a globally persistent identifier (e.g. a DOI) ensures that your dataset continues to be available to both humans and machines in a usable form in the future. Try to identify suitable repositories for your main data type(s).

Tools

3. What are the documentation guidelines for the repositories?

Repositories often have documentation guidelines that are according to domain-specific standards, which will improve the FAIRness of your data. Organising your output documentation according to these guidelines and standards from the start will make your FAIRification journey much easier.

Which formats for data and metadata are required to be able to submit? Include the link to documentation guidelines.

4. Under what licenses will your research outputs be shared?

From a FAIR perspective, you should aim for a license which allows your data to be shared as openly as possible. Does the repository decide the license, or is this decided by you as a submitter? Identify the terms of the repository. If you decide, which license would you choose and why? The Creative commons license chooser is a tool to assist you in finding an appropriate license for data. (Other tools exist if you instead want to put a license on code or software e.g. Choose a license.)

The scenarios

Select one of the scenarios below and answer the questions.

Project A

Gene expression profiling of Fowlpox knock-out mutant viruses using genome microarrays in order to investigate the ability of avian fowlpox virus to modulate host antiviral immune responses.

Project B

Protein expression analysis (proteomics) in order to identify novel protein binders of human MIRO2 in prostate cancer using tandem mass spectrometry (MS/MS).

Project C

Thermal proteome profiling dataset of Hazara virus infected cells. Thermal proteome profiling data type is a new type developed by your group. The dataset consists of a spreadsheet containing proteins identified and experimental conditions.

Questions to answer

  • What types of outputs will you be creating or collecting?
  • What are suitable repositories for your outputs?
  • What are the documentation guidelines for the repositories? Which formats for data and metadata are required to be able to submit?
  • Under what licenses will your research outputs be shared?

Solutions

Solution project A

Data types

  • Microarray gene expression data

Repository

Documentation guidelines

License

No clear license, but ArrayExpress data access policy: “No restrictions, all public data from ArrayExpress can be used by anyone and our services are completely free of charge.”

Solution project B

Data types

  • Mass spectra

Repository

Documentation guidelines

Prepare submission

  • Ensure that we have RAW (raw data files), RESULT (analysis files in mzIdentML format) and PEAK files

License

No clear license but Citing PRIDE: “All datasets in PRIDE (as part of ProteomeXchange) are made fully open, once the corresponding paper is published.”

Solution project C

Data types

  • Spreadsheet in xlsx format

Repository

Documentation guidelines

License

There are several licenses to choose between, we decided upon Creative Commons Zero (CC0) in order to make the dataset as open and freely available as possible.