Step 1 — Compilation and Standardisation of Public Protist Trait Databases
Overview and Rationale
Trait-based approaches are increasingly recognised as essential for understanding protist ecology, biogeography, and ecosystem functioning. However, compared to plants and animals, functional trait information for protists remains highly fragmented across studies, taxonomic groups, ecosystems, and data formats.
Although several public trait datasets exist, they differ strongly in:
- taxonomic scope and resolution,
- trait definitions and terminology,
- ecological focus (marine, freshwater, soil),
- file structure and accessibility.
The goal of Step 1 is therefore to construct a transparent, reproducible foundation by structurally integrating existing public trait resources while preserving their original meaning and provenance.
Specifically, we aimed to:
- Identify relevant peer-reviewed protist trait databases with public access.
- Retrieve these datasets from their original repositories.
- Read each dataset using appropriate, source-specific methods.
- Convert all datasets into a unified long-format trait representation.
- Preserve taxonomic resolution, trait context, and citations.
- Defer semantic harmonisation to later workflow stages.
We specifically separate structural standardisation and semantic harmonisation to avoid premature interpretation of trait concepts.
Conceptual Motivation and Literature Context
- Jamy et al. (2025) — Towards a trait-based framework for protist ecology and evolution (Trends in Microbiology, DOI: 10.1016/j.tim.2025.08.008)
- Burki et al. (2021) — Diversity and ecology of protists revealed by metabarcoding (Current Biology, DOI: 10.1016/j.cub.2021.07.066)
Both highlight that trait-based protist ecology is currently limited not by lack of data, but by inconsistent trait definitions, scattered data sources, and missing integrative frameworks.
Step 1 addresses this limitation by compiling existin datasets into a machine-readable backbone for downstream ontology matching.
Identification of Public Trait Databases
Trait databases were included based on the following criteria:
- availability via public repositories (supplementary materials, GitHub, Zenodo, SEANOE),
- explicit linkage between taxa and functional traits,
- relevance to trophic mode, morphology, size, habitat, or life history.
Currently included Databases
Overview table
| Dataset / Reference | Taxonomic Scope | Data Format / Repository |
|---|---|---|
| Dumack et al. (2019) | Cercozoa & Endomyxa | tab-delimited text / GitHub |
| Freundenthal et al. (2025) | Amoebozoa | tab-delimited text / GitHub, Zenodo |
| Ramond et al. (2018, 2019) | Marine protists | CSV / SEANOE 56963 |
| Schneider et al. (2020) | Protists | trait tables / DOI: 10.3897/BDJ.8.e56648 |
| Giachello et al. (2023) | Soil protists | Excel / DOI: 10.1016/j.soilbio.2023.109207 |
| Mitra et al. (2023) | Mixoplankton | Excel / Zenodo 7839780 |
| Bjørbækmo et al. (2019) | Protist interactions | CSV / Zenodo |
| Põlme et al. (2020) | Fungi & fungus-like Stramenopiles | Google Sheets / DOI: 10.1007/s13225-020-00466-2 |
| Lentendu et al. (2025) | Soil eukaryotes | Zenodo / GitHub / DOI: 10.1111/1755-0998.14118 |
| Rimet et al. (2018) | Freshwater phytoplankton | CSV / Zenodo |
| Laplace-Treyture et al. (2021) | French freshwater phytoplankton | CSV / Figshare |
for more detailed information of public databases public-databases
Unified Long-Format Trait Schema
All datasets were converted into a shared long-format schema designed to:
- preserve original taxonomic resolution,
- separate trait meaning from source-specific column names,
- retain full citation metadata,
- support later ontology-based harmonisation.
Core Fields
Taxonomy
- taxon_name
- taxon_rank
- genus, family, order, class, phylum, supergroup, domain
Trait Information
- trait_category
- trait_name
- trait_value
- trait_unit
Context
- habitat
- environment
- life_stage
Provenance
- source_db
- source_table
- reference_id
- reference_full
Audit Metadata
- original_column
- original_value
- notes
Dataset-Specific Extraction Strategy
Each database was processed using a dedicated extraction function that:
- Identified trait-relevant columns.
- Pivoted the data into long format.
- Assigned broad trait categories (e.g. trophic, morphology, size).
- Preserved original values and column names.
- Attached explicit provenance metadata.
No semantic harmonisation or value standardisation was applied at this stage.
Output of Step 1
Step 1 produces a set of structurally aligned long-format trait tables, one per source database. Together, these form a comprehensive, provenance-aware corpus spanning marine, freshwater, soil, and host-associated protists.
This corpus serves as the sole input for subsequent workflow stages.
Position in the Overall Workflow
- Step 1 (this document): Structural compilation and standardisation of public protist trait databases.
- Step 2 (next): Semantic harmonisation of trait categories, values, and vocabularies using a formal trait ontology.