Overview
Teaching: 10 min
Exercises: 15 minQuestions
How do I create a metadata file compatible with the repository of choice?
Objectives
Export a file from OpenRefine with a subset of columns needed for submitting the data to the repository ENA.
Lesson
Sometimes you would like to export a file that only contains a subset of the data in your project that conforms to a specific standard.
In this lesson of the Introduction to Data Management Practices Workshop we will act as a researcher who wants to submit a subset of the data to the repository ENA (European Nucleotide Archive). To do so we need to prepare sample metadata to conform to the metadata standards of the repository.
We need to consider the following questions:
- Which of the existing columns are relevant for the submission?
- Are they named correctly?
- Are there additional columns that need to be added?
First, we know the data contain three samples per individual, i.e. every individual is represented by three rows in the data. Now, we might want to extract one of those samples for submission, namely the ones produced by the Illumina NEBNext prep kit. We identify these as “NEBNext” in the column configuration
.
- Create a filter for the column
configuration
by clicking the down arrow in the column header and selectingText filter
. - A filter box will appear on the left side. Type
NEB
in the text field and press return. 29 matching rows will be displayed. - In the same box, press
invert
in the top right to select all the rows which do not have the configuration NEBNext. Note that the box header will turn orange to indicate inverted results. - Confirm that there are 62 matching rows.
- Click the down arrow next to
All
in the left-most column header >Edit rows
>Remove matching rows
- Remove or reset the filter. Now, all remaining 29 rows should be
NEBNext
samples. This subsection of the data can be saved using theExport
drop down menu ascomma-separated values
, to be imported into other software likeR
.
ENA sample metadata
ENA sample metadata can be divided into three groups.
- Mandatory metadata
- Checklist-based metadata
- User-defined metadata
In a previous lesson in this workshop around metadata you already came across the ENA checklists. When creating a data dictionary you identified ENA variables based on the default checklist and ontologies to identify allowed values. We will return to these in a while, but first, we will look at the mandatory fields.
Mandatory fields
Some metadata are mandatory for all samples submitted to ENA regardless of the checklist chosen. We will try to map the existing columns to these variables.
Basic details
sample_alias - The unique name is a submitter provided unique identifier
Organism details
tax_id - The NCBI taxonomy id
scientific_name - based on tax_id
Exercise 6.1
Can any of the existing columns be used as
sample_alias
,tax_id
andscientific_name
?Solution
sample_alias
needs to be created. We will do that under the header “Join columns”.- A column
tax_id
already exists- A column
scientific_name
already exists
Creating new columns
Sometimes a new variable needs to be added to a dataset, a new set of data input, or a transfer of information from another data source. So far, we have only covered editing already existing columns and cells, but how do we create space for new data in an already open project?
For example, in our dataset we are missing a column for naming the institute responsible for collecting the listed samples. To create such a column, we select a column to the left of where we want to create a new one, and in that column select Edit column
> Add column based on this column...
. In the new window, in New column name
, type collector name
, and enter null
as value
. Clicking OK
now creates a new column to the right of the one we just used, without cell values. Repeating what we did earlier in the lesson, edit
the contents to the input Valeria Ghiselli
and select Apply to All Identical Cells
.
Exercise 6.2
Add a new column called
collecting entity
, and fill all cells with the inputAmedeo di Savoia
. Can you generate the column to the right of the columncollector name
?Solution
- On the
collector name
column, selectEdit column
>Add column based on this column...
.- Type
collecting entity
inNew column name
, and provide the inputvalue
asnull
just as before.edit
the contents of the cells toAmedeo di Savoia
and selectApply to All Identical Cells
.
We made a mistake!
The column name collecting entity
was an incorrect input. The ENA checklist suggests it is better named collecting institution
. We need to rename the column, but how do we do it?
To rename the column collecting institution
- Click the down arrow next to
collecting entity
>Edit column
>Rename this column
. A pop-up window will appear on top. - Type in
collecting institution
and press return.
Join columns
As we noted in the metadata module, there was a column in the data for sample_alias
. In this dataset, we have lost that information and need to re-enter it. Aliases are only used as communicative references in a project but can help in identifying, separating, and clustering individual samples in downstream analyses.
To create aliases for our samples, we can combine cell information from the columns configuration
and host subject id
to create unique combinations.
- First we click the down arrow next to
configuration
and selectEdit column
>Join columns...
. A pop-up window will appear. - On the left side, choose which columns to join. Verify that
configuration
is already ticked and tickhost subject id
as well. (Notice that you can change the order of the information to combine by drag-and-drop the order of ticked columns in the list). To the right, there are several options for the join. - In the separator field, enter an underscore
_
. - Click
Write result in new column named…
and entersample_alias
. - If ticked, untick the option
Delete joined columns.
(we wish to keep the originating columns) and clickOK
. - Move the new
sample_alias
column to the beginning by selectingEdit column
>Move column to beginning
.
Generate information by joining multiple columns
Our dataset still lacks information for the mandatory ENA checklist field isolate
, which is defined as the “individual isolate from which the sample was obtained”. Such information can be generated in a number of formats, but here, we will use a combination of [organism/host/location/isolate/date].
Exercise 6.3
Create a new column called
isolate
, and populate the cells with information for [organism/host/location/isolate/date] as defined by the ENA checklist. Notice that our column names are not (yet) aligned with the ontology of the checklist.Solution
- In the
virus_identifier
column, selectEdit column
>Join columns...
.- In the list, tick boxes for the columns:
organism
->virus identifier
host
->host common name
location
->geographic location (country)
isolate
->sample_alias
date
->collection date
.- Add a slash
/
asSeparator between the content of each column:
- Tick
Write result in new column named...
and typeisolate
. This will create a new column namedisolate
populated with the above information.- Make sure the box for
Delete joined columns
is not ticked. Once you are confident your selection is correct, make sure to drag-and-drop the columns to appear in the desired order.- Select
OK
. All cells should now be populated with isolate information.
Checklist-based and user-defined metadata
Now, let’s have a look at the data dictionary from the metadata lesson. Which variables did you identify an ENA variable name for based on the ENA checklist? Rename the columns where necessary.
Solution
Data dictionary:
Current variable name ENA Variable name Measurement unit Allowed values Definition Description sample id patient id host subject id sex host sex male, female, not collected Sex of the individual date collection date format: YYYY-MM-DD, >=proj_start_date & <=today Date of sampling location geographic location (country and/or sea) location geographic location (region and locality) , , ... age host age years Age of individual at the time of sampling health state host health state diseased, healthy, not applicable, not collected, not provided, restricted access Health state of individual at time of sampling symptoms illness symptoms fever, sore throat, fatigue, loss of taste or smell, not applicable Symptoms experienced in connection with illness disease outcome host disease outcome recovered, dead Final outcome of disease tissue isolation source host-associated Tissue sampled isolate isolate individual isolate from which the sample was obtained Renaming columns:
- Click the down arrow next to a column title to rename >
Edit column
>Rename this column
. A pop-up window will appear on top.- Type in the ENA variable name from the Data dictionary above and press return.
- Repeat the steps to rename all columns mentioned in the list above.
With all columns renamed to comply with the ERC000033 checklist, what remains is to check if the entered values for the field name also correspond with the correct field restrictions.
First, we notice the values for host health state
are only semi-correct. In our column we have stated the values ill
and healthy
. The correct values should be diseased
and healthy
.
Second, we can see some inconsistencies in the illness symptoms
column. Make sure to correct any mis-spelling (e.g. sore throt).
Removing and reordering data
We have now identified and correctly named the columns we want to include for the ENA submission. The final step is to remove unwanted columns and to make sure they conform with the controlled vocabulary (ENA Checklist).
Exercise 6.4
- Using your knowledge from the previous exercises, make sure all required variables are in order or no extra columns are included in your subset of data.
- Export the
NEBNext
subset of data intsv
format.Solution
Remove all but the following columns and order them according to the list below by
- Click the down arrow in the leftmost column header
All
and selectEdit columns
Re-order / remove columns…
- In the pop/up, drag and re/order the columns to keep in the left box, and put the columns to remove in the right box
- When everything looks good, press
OK
- sample_alias
- collection date
- tax_id
- scientific_name
- sample description
- geographic location (country)
- geographic location (region and locality)
- host disease outcome
- host common name
- host subject id
- host age
- host health state
- host sex
- collector name
- collecting institute
- isolate
- illness symptoms
- Click
Export
in the top right and select the file type you want to export the data in. In this case we will chooseTab-separated values
(tsv
).- The file will be saved in your default downloads folder. Move the file to your course folder and save it as
ENA_samples_openrefine_lesson.tsv