Keeping records

Licenced under CC-BY 4.0 and OSI-approved licenses, see licensing.

Overview

Teaching: 15 min
Exercises: 15 min
Questions
  • Why and how do we keep good research records?

  • How do we keep our records FAIR?

Objectives
  • Identify the pros and cons with analogue vs. digital notes

  • Adopt good practices for data analysis documentation

About this episode

The data that you collect, organise, prepare, and analyse to answer your research questions, and the documentation describing it, are the lifeblood of your research. Put bluntly: Without data, there is no research. And while data is important, without proper documentation, your data might be useless for others as well as yourself in the future.

  1. About this episode
  2. Starting up Your Research
    1. Exercise 1
    2. Examples of major issues
    3. Exercise 2
    4. Problem areas
  3. Why do we need to keep good-quality records?
  4. Principles for good records
  5. Exercise 3 - Test yourself on record keeping statements
  6. Solution
  7. Reaching for FAIR using README-files
  8. Using Markdown for documentation
    1. Exercise 4
  9. Backing up your files and folders
    1. Exercise 5
    2. Comments
    3. Creating a backup strategy in 10 steps
    4. Further reading

Starting up Your Research

Congratulations researcher! You have been recruited to the lab of Terribly (aka Terry) Famous, an authority in the field with many benchmark publications in the CV. It is truly an honor to work here, and you outcompeted quite a few applicants to earn the position. It will be a major addition to your CV, and will likely increase your chances to receive funding for your own projects in the future.

The road to success is open!

On your first day of work Famous lets you know that you will continue, and build on, the work left behind by a previous PhD, Wang Fang (王芳), who left the Famous lab six months ago to pursue a career elsewhere (former co-workers claim she is in China, or the USA. Or maybe the UK. Or Germany. Nobody knows for sure). Before leaving, Wang sent Famous a photo of her lab notes documenting the data you will work with.

She also left behind a USB stick containing all known files and folders in a zip file.

Lab notes by Wang

Puzzled by the note you ask Famous for additional information. The reply is that all researchers in the Famous lab work independently and are responsible for their own data and lab notes. If the information is not in the notes, the publication, or the zipped folder, it does not exist.

Exercise 1

Can you list at least five major issues with the lab documentation in the image above?

Examples of major issues

  • Unknown if more pages of the notes exist
  • Difficult to read the text
  • Not all text is included in the image
  • Explicit dates are lacking
  • Text follows no clear structure
  • Notes follow no clear timeline
  • Notes mixed with scribbles
  • Unknown if errors in data and/or file names are/were corrected or not
  • Unknown how notes relate to data
  • Changes to files are not referenced in zip file
  • Notes indicate more information exists (email to supervisor)
  • Notes raise questions about data quality
  • Lab notes are mixed with personal notes
  • Unknown what analysis pipeline was used
  • Etc…

Exercise 2

Give one or more example(s) of what kind of general questions the note and Famous’ answer raise about the work done in the lab?

Problem areas

  • Is this typical of all the work done in the lab?
  • What results from the lab can be trusted?
  • Are there no established routines for data documentation?
  • Are there no established routines for data backup?
  • Etc…

We will come back to the work at the Famous lab later. What we need to ask ourselves now is…

Why do we need to keep good-quality records?

Good scientific practice depends on keeping and maintaining good records. Good records ensure the data, analysis and results are transparent, reproducible and traceable to relevant persons. Traceability is also a guarantee that someone is accountable and can be contacted for further questions and clarifications.

Keeping good records will prevent future issues, where revelations about the past data handling and metadata quality can question not only the original results, but also the subsequent research building on such data; (A recent example). As science is cumulative, uncorrected mistakes may multiply over time.

Good practice reduces the risk for data mistakes, data manipulation and research fraud. Making data and documentation open and transparent promotes the values of open science, and in the longer perspective, safeguards the integrity of science itself. Inability to share data and documentation, or inconsistencies in published results has revealed high-level fraud in the realm of science (e.g. the infamous cases of Dr. Yoshitaka Fuji, or Joachim Boldt), who both fabricated data and results, resulting in hundreds of retracted papers.

Not only is the fabrication of data and/or results a threat to the integrity of science itself. Once published, fraudulent papers can keep on being cited years after being retracted.

While fraudulent activity is indeed a problem, the more positive arguments for keeping good quality records can be described by the FAIR principles. Good records promote data and documentation being

  • Findable,
  • Accessible,
  • Interoperable, and
  • Reusable

In that context written lab notes on paper can still fulfil the FAIR principles, but to a lesser degree than digital ones. Making your lab notes and protocols digital, and even available online, promotes sharing them with anyone who needs them for a publication. Submitting them to a public repository (e.g. Zenodo, or FigShare) provides them with persistent identifiers (PIDs) and makes them readily citeable.

There are several platforms for keeping digital lab notes, (see here for a comprehensive list and comparison of different platforms), documenting your workflow and making the data and documentation easier to access and share among people and across time.

Principles for good records

Protocols and lab notes should be kept detailed, up-to-date, and accurate. They should be possible to access and be easily understood by both yourself and others regardless of time. Keeping records in digital format ensures easy back up and increased shareability. Content of records can include, but should not be limited to:

  • Your name, affiliation and contact information
  • Who the originator of the protocol is (if not you)
  • Detailed and structured information on why and how an experiment was done
  • Health and safety advice
  • Required hardware, software, or materials/instruments being used and when/where they were obtained
  • Sufficient information so that someone can understand what has been done without having to ask others
  • Described mistakes, so they can be avoided in future applications of the protocol

While the protocol is a confirmed recipe for making research data in an experiment, some information surrounding the experiment is good to keep separate records on, in a lab notebook. The notebook is your place for notes and comments on the protocol. In addition to being kept well-organised and accurate, a lab notebook can include the following:

  • Relevant details on what you did in the lab, when, and how
  • Your name and affiliation
  • What project the experiment is part of
  • Information on lot and batch numbers for used consumables (e.g. reagents and chemicals)
  • Information on what metadata is collected for each data type collected
  • What happened, and what did not happen
  • How the result was treated and analysed
  • Your interpretation of the outcome and how you plan to proceed

Exercise 3 - Test yourself on record keeping statements

Read the following statements and decide which ones are true (T) or false (F)

  • Analogue and digital records make information equally findable.
  • New information in digital records can be easily shared with other users.
  • Analogue records can be kept safe from any physical accidents.
  • All researchers in a shared lab should have access to the same platform for keeping records and taking notes.
  • Digital records should follow the same backup strategy as the data they describe.

Solution

  • Analogue and digital records make information equally findable. (F)
  • New information in digital records can be easily shared with other users (T)
  • Analogue records can be kept safe from any physical accidents (F)
  • All researchers in a shared lab should have access to the same platform for keeping records and taking notes (T)
  • Digital records should follow the same backup strategy as the data they describe (T)

Reaching for FAIR using README-files

README-files are good for a lot of things, perhaps most often encountered when installing software, and the mere presence of a README-file attracts attention when encountered.

Unless you are of the opinion that data speaks for itself, explanation of file content and folder structure might be required, and why not include it in a README and place it easily findable?

Such files can be added at many levels in the file hierarchy:

  • Folder level, aiming at explaining the contents, naming, file history, and organisation of a folder structure, and/or
  • Together with e.g. data, explaining file naming convention and/or file contents.

The purpose of adding README files is to explicitly document everything you (and others) need in order to understand the files and folders in the future.

Using Markdown for documentation

Markdown was developed as a lightweight easy-to-read and easy-to-write web text format emphasizing readability, (for example, the web pages for this course are written in Markdown). By creating your README-files using Markdown (.md) you guarantee that the files only include plain text with powerful yet simple formatting syntax, which can be combined with e.g. HTML tags.

But why use Markdown, why not make a plain text file (.txt)?

  • The plain text file is limited to plain text. You lack any options to format the text or contents.

  • Markdown is highly compatible with e.g. GitHub.

  • Allows explanatory comments to be included in the text without having to be visualised.

  • Easily editable and versatile but not requiring particular skills.

Exercise 4

Think of an example where you would have benefited from having access to a README-file when working with data. Describe to your neighbor what you would have wanted such a file to contain.

Backing up your files and folders

Even if memory serves you well, technology might not. Know your storage needs and plan solutions accordingly. Factors playing a role are, for example, data sensitivity, ease of access, file size and overall data volume. You can also ask yourself where, how and by whom your data will be produced, accessed, transformed, and transferred throughout and beyond the project.

  • Nearly all data, metadata and project information necessary to understand your analysis and results require some sort of backup strategy.
  • Try to keep backup in three separate locations, on at least two different kinds of media (server, portable hard drive, cloud). Consider off-site backups.
  • Never back up your data on portable drives only (SSD or ATA), and particularly not on USB sticks!
  • Robust backups need to be automated.

Exercise 5

Discuss in pairs the validity of the following statements on data backup:

A. I have my most important data backed up on my laptop. I have never experienced a hard drive failure, and my current laptop has a brand new state-of-the-art hard drive. Therefore, I don’t need external backups.

B. All my data is stored in a cloud service, or on a computation cluster (e.g. UPPMAX).

C. My data is on a portable hard drive. There is a backup of the most important files on a shared USB stick in my research group.

D. My data is on a departmental backup administered by my University. Additionally we have a server for all the data stored in our project.

E. We have no shared backup at all. All members in our research group are responsible for their own data.

Comments

A. Unsafe and not recommended. All hard drives can be subject to failure. In case of failure, all data will be lost.

B. Cloud services can be sufficient as backup, but are not fail safe. It can be sufficient in combination with a secondary backup on e.g. a shared server. For certain types of data (e.g. sensitive information), a cloud service may be outright inappropriate.

C. Not a good solution. Both portable hard drives as well as USB sticks are prone to failure.

D. A good solution in general. Data is stored independently in two separate systems. Centrally administered services are usually organised in such a way that partial failures do not affect the users.

E. Worst possible alternative. A disaster waiting to happen.

Creating a backup strategy in 10 steps

  1. Find out whether your institution has a backup strategy
  2. Determine what you want to back up
  3. Decide how many backups you will need and how frequently to back up
  4. Decide where backups will be stored
  5. Determine how much storage capacity will be needed
  6. Determine if there are tools you could use to automate backup
  7. Determine how long backups will be kept and how they will be destroyed
  8. Determine how personal data will be protected
  9. Devise a disaster recovery plan
  10. Assign responsibilities

Further reading