Controlled vocabularies & ontologies

Licenced under CC-BY 4.0 and OSI-approved licenses, see licensing.

Overview

Teaching: 10 min
Exercises: 5 min
Questions
  • How can metadata be described consistently?

Objectives
  • To understand what controlled vocabularies and ontologies are

Someone once said “A biologist would rather share a toothbrush with another biologist than share a gene name”. This is probably true for other domains of research too. If we all stick to our own devices and name things as we see fit without regarding our fellow researchers we run the risk of being inconsistent and unclear. To help with consistency when describing our data, we can use standardized term collections - Controlled vocabularies, Ontologies, Thesauruses, and Taxonomies. The names for these different types of collections are often used interchangeably.

How many phenomena?

How many different medical conditions do you think this list of terms describes.

Bloodstream Infection, Circulatory Failure, Toxic Shock Syndrome, Pyemia, Circulatory Collapse, Blood Poisoning, Endotoxin Shock, Pyohemia, Hypovolemic Shock, Septicemia, Sepsis-associated hypotension, Pyaemia

Solution

Sepsis Shock Septic shock
Blood Poisoning Circulatory Collapse Endotoxin Shock
Bloodstream Infection Circulatory Failure Sepsis-associated hypotension
Pyaemia Hypovolemic Shock Toxic Shock Syndrome
Pyemia    
Pyohemia    
Septicemia    

A Controlled vocabulary is a list of terms that describes a certain domain of knowledge. In the controlled vocabulary you only use one term to describe one particular phenomenon, excluding all other synonyms. The vocabulary should provide a definition for the term, and any synonyms. In a publicly managed controlled vocabulary, the terms should also have unique identifiers, so that they can be referenced.

An ontology (in this context) is a controlled vocabulary, that apart from being a list of agreed terms, also captures relationships between these terms.

For example,

  • in the Human Phenotype Ontology, Myocardial infarction is a type of Abnormal cardiovascular system physiology is a type of Abnormality of the cardiovascular system is a type of Phenotypic abnormality.
  • in the BRENDA Tissue Ontology, the heart valve is a part of the heart is a part of the cardiovascular system is a part of the whole body.

As you see, depending on the way you look at reality, the domains of knowledge have to be structured in several different ways. There is no all-encompassing ontology that captures everything. You will have to rely on several different ontologies to describe your research. The question is which to choose.

You can decide to make your own controlled vocabularies. It might seem like less work than finding good ones that already exist. In the long run, you are better off using publicly managed vocabularies and lists, as much of the thinking about describing different domains has already been done. Another important aspects is that it will support the machine-readability aspect of FAIR. With unique identifiers for terms that describes your data (and their relationships), it can be possible for computer code to generate knowledge descriptions from data.