Overview

Teaching: 10 min
Exercises: 15 min

Questions

How do I create a metadata file compatible with the repository of choice?

Objectives

Export a file from OpenRefine with a subset of columns needed for submitting the data to the repository ENA.

Lesson

Sometimes you would like to export a file that only contains a subset of the data in your project that conforms to a specific standard.

In this lesson of the Introduction to Data Management Practices Workshop we will act as a researcher who wants to submit a subset of the data to the repository ENA (European Nucleotide Archive). To do so we need to prepare sample metadata to conform to the metadata standards of the repository.

We need to consider the following questions:

Which of the existing columns are relevant for the submission?
Are they named correctly?
Are there additional columns that need to be added?

First, we know the data contain three samples per individual, i.e. every individual is represented by three rows in the data. Now, we might want to extract one of those samples for submission, namely the ones produced by the Illumina NEBNext prep kit. We identify these as “NEBNext” in the column configuration.

Create a filter for the column configuration by clicking the down arrow in the column header and selecting Text filter.
A filter box will appear on the left side. Type NEB in the text field and press return. 29 matching rows will be displayed.
In the same box, press invert in the top right to select all the rows which do not have the configuration NEBNext. Note that the box header will turn orange to indicate inverted results.
Confirm that there are 62 matching rows.
Click the down arrow next to All in the left-most column header > Edit rows > Remove matching rows
Remove or reset the filter. Now, all remaining 29 rows should be NEBNext samples. This subsection of the data can be saved using the Export drop down menu as comma-separated values, to be imported into other software like R.

ENA sample metadata

ENA sample metadata can be divided into three groups.

Mandatory metadata
Checklist-based metadata
User-defined metadata

In a previous lesson in this workshop around metadata you already came across the ENA checklists. When creating a data dictionary you identified ENA variables based on the default checklist and ontologies to identify allowed values. We will return to these in a while, but first, we will look at the mandatory fields.

Mandatory fields

Some metadata are mandatory for all samples submitted to ENA regardless of the checklist chosen. We will try to map the existing columns to these variables.

Basic details
sample_alias - The unique name is a submitter provided unique identifier

Organism details
tax_id - The NCBI taxonomy id
scientific_name - based on tax_id

Exercise 6.1

Can any of the existing columns be used as sample_alias, tax_id and scientific_name?

Solution

sample_alias needs to be created. We will do that under the header “Join columns”.

A column tax_id already exists

A column scientific_name already exists

Creating new columns

Sometimes a new variable needs to be added to a dataset, a new set of data input, or a transfer of information from another data source. So far, we have only covered editing already existing columns and cells, but how do we create space for new data in an already open project?

For example, in our dataset we are missing a column for naming the institute responsible for collecting the listed samples. To create such a column, we select a column to the left of where we want to create a new one, and in that column select Edit column > Add column based on this column.... In the new window, in New column name, type collector name, and enter null as value. Clicking OK now creates a new column to the right of the one we just used, without cell values. Repeating what we did earlier in the lesson, edit the contents to the input Valeria Ghiselli and select Apply to All Identical Cells.

Exercise 6.2

Add a new column called collecting entity, and fill all cells with the input Amedeo di Savoia. Can you generate the column to the right of the column collector name?

Solution

On the collector name column, select Edit column > Add column based on this column....

Type collecting entity in New column name, and provide the input value as null just as before.

edit the contents of the cells to Amedeo di Savoia and select Apply to All Identical Cells.

We made a mistake! The column name collecting entity was an incorrect input. The ENA checklist suggests it is better named collecting institution. We need to rename the column, but how do we do it?

To rename the column collecting institution

Click the down arrow next to collecting entity > Edit column > Rename this column. A pop-up window will appear on top.
Type in collecting institution and press return.

Join columns

As we noted in the metadata module, there was a column in the data for sample_alias. In this dataset, we have lost that information and need to re-enter it. Aliases are only used as communicative references in a project but can help in identifying, separating, and clustering individual samples in downstream analyses.

To create aliases for our samples, we can combine cell information from the columns configuration and host subject id to create unique combinations.

First we click the down arrow next to configuration and select Edit column > Join columns.... A pop-up window will appear.
On the left side, choose which columns to join. Verify that configuration is already ticked and tick host subject id as well. (Notice that you can change the order of the information to combine by drag-and-drop the order of ticked columns in the list). To the right, there are several options for the join.
In the separator field, enter an underscore _.
Click Write result in new column named… and enter sample_alias.
If ticked, untick the option Delete joined columns. (we wish to keep the originating columns) and click OK.
Move the new sample_alias column to the beginning by selecting Edit column > Move column to beginning.

Generate information by joining multiple columns

Our dataset still lacks information for the mandatory ENA checklist field isolate, which is defined as the “individual isolate from which the sample was obtained”. Such information can be generated in a number of formats, but here, we will use a combination of [organism/host/location/isolate/date].

Exercise 6.3

Create a new column called isolate, and populate the cells with information for [organism/host/location/isolate/date] as defined by the ENA checklist. Notice that our column names are not (yet) aligned with the ontology of the checklist.

Solution

In the virus_identifier column, select Edit column > Join columns....

In the list, tick boxes for the columns:

organism -> virus identifier

host -> host common name

location -> geographic location (country)

isolate -> sample_alias

date -> collection date.

Add a slash / as Separator between the content of each column:

Tick Write result in new column named... and type isolate. This will create a new column named isolate populated with the above information.

Make sure the box for Delete joined columns is not ticked. Once you are confident your selection is correct, make sure to drag-and-drop the columns to appear in the desired order.

Select OK. All cells should now be populated with isolate information.

Checklist-based and user-defined metadata

Now, let’s have a look at the data dictionary from the metadata lesson. Which variables did you identify an ENA variable name for based on the ENA checklist? Rename the columns where necessary.

Solution

Data dictionary:

Current variable name ENA Variable name Measurement unit Allowed values Definition Description

sample id

patient id host subject id

sex host sex male, female, not collected Sex of the individual

date collection date format: YYYY-MM-DD, >=proj_start_date & <=today Date of sampling

location geographic location (country and/or sea)

location geographic location (region and locality) , , ...

age host age years Age of individual at the time of sampling

health state host health state diseased, healthy, not applicable, not collected, not provided, restricted access Health state of individual at time of sampling

symptoms illness symptoms fever, sore throat, fatigue, loss of taste or smell, not applicable Symptoms experienced in connection with illness

disease outcome host disease outcome recovered, dead Final outcome of disease

tissue isolation source host-associated Tissue sampled

isolate isolate individual isolate from which the sample was obtained

Renaming columns:

Click the down arrow next to a column title to rename > Edit column > Rename this column. A pop-up window will appear on top.

Type in the ENA variable name from the Data dictionary above and press return.

Repeat the steps to rename all columns mentioned in the list above.

Current variable name	ENA Variable name	Measurement unit	Allowed values	Definition
sample id
patient id	host subject id
sex	host sex		male, female, not collected	Sex of the individual
date	collection date		format: YYYY-MM-DD, >=proj_start_date & <=today	Date of sampling
location	geographic location (country and/or sea)
location	geographic location (region and locality)		, , ...
age	host age	years		Age of individual at the time of sampling
health state	host health state		diseased, healthy, not applicable, not collected, not provided, restricted access	Health state of individual at time of sampling
symptoms	illness symptoms		fever, sore throat, fatigue, loss of taste or smell, not applicable	Symptoms experienced in connection with illness
disease outcome	host disease outcome		recovered, dead	Final outcome of disease
tissue	isolation source host-associated			Tissue sampled
isolate	isolate			individual isolate from which the sample was obtained

With all columns renamed to comply with the ERC000033 checklist, what remains is to check if the entered values for the field name also correspond with the correct field restrictions.

First, we notice the values for host health state are only semi-correct. In our column we have stated the values ill and healthy. The correct values should be diseased and healthy.

Second, we can see some inconsistencies in the illness symptoms column. Make sure to correct any mis-spelling (e.g. sore throt).

Removing and reordering data

We have now identified and correctly named the columns we want to include for the ENA submission. The final step is to remove unwanted columns and to make sure they conform with the controlled vocabulary (ENA Checklist).

Exercise 6.4

Using your knowledge from the previous exercises, make sure all required variables are in order or no extra columns are included in your subset of data.

Export the NEBNext subset of data in tsv format.

Solution

Remove all but the following columns and order them according to the list below by

Click the down arrow in the leftmost column header All and select Edit columns Re-order / remove columns…

In the pop/up, drag and re/order the columns to keep in the left box, and put the columns to remove in the right box

When everything looks good, press OK

sample_alias

collection date

tax_id

scientific_name

sample description

geographic location (country)

geographic location (region and locality)

host disease outcome

host common name

host subject id

host age

host health state

host sex

collector name

collecting institute

isolate

illness symptoms

Click Export in the top right and select the file type you want to export the data in. In this case we will choose Tab-separated values (tsv).

The file will be saved in your default downloads folder. Move the file to your course folder and save it as ENA_samples_openrefine_lesson.tsv

Transforming and Selecting a Subset of the Data

Overview

Lesson

ENA sample metadata

Mandatory fields

Exercise 6.1

Solution

Creating new columns

Exercise 6.2

Solution

Join columns

Generate information by joining multiple columns

Exercise 6.3

Solution

Checklist-based and user-defined metadata

Solution

Data dictionary:

Removing and reordering data

Exercise 6.4

Solution