GTDB-Tk v2: memory friendly classification with the genome taxonomy database (2024)

Article Navigation

Volume 38 Issue 23 1 December 2022

Article Contents

  • Abstract

  • 1 Introduction

  • 2 Materials and methods

  • 3 Results

  • 4 Summary

  • Acknowledgements

  • References

  • < Previous
  • Next >

Journal Article

,

Pierre-Alain Chaumeil

Australian Centre for Ecogenomics, School of Chemistry and Molecular Biosciences, The University of Queensland

, St Lucia, QLD 4072,

Australia

Research Computing Center, The University of Queensland

, St Lucia, QLD 4072,

Australia

To whom correspondence should be addressed. Email: p.chaumeil@uq.edu.au or donovan.parks@gmail.com

Search for other works by this author on:

Oxford Academic

,

Aaron J Mussig

Australian Centre for Ecogenomics, School of Chemistry and Molecular Biosciences, The University of Queensland

, St Lucia, QLD 4072,

Australia

Search for other works by this author on:

Oxford Academic

,

Philip Hugenholtz

Australian Centre for Ecogenomics, School of Chemistry and Molecular Biosciences, The University of Queensland

, St Lucia, QLD 4072,

Australia

Search for other works by this author on:

Oxford Academic

Donovan H Parks

Australian Centre for Ecogenomics, School of Chemistry and Molecular Biosciences, The University of Queensland

, St Lucia, QLD 4072,

Australia

To whom correspondence should be addressed. Email: p.chaumeil@uq.edu.au or donovan.parks@gmail.com

Search for other works by this author on:

Oxford Academic

Bioinformatics, Volume 38, Issue 23, 1 December 2022, Pages 5315–5316, https://doi.org/10.1093/bioinformatics/btac672

Published:

11 October 2022

Article history

Received:

10 July 2022

Revision received:

23 September 2022

Editorial decision:

03 October 2022

Accepted:

07 October 2022

Published:

11 October 2022

Corrected and typeset:

25 October 2022

Search

Close

Search

Advanced Search

Search Menu

Abstract

Summary

The Genome Taxonomy Database (GTDB) and associated taxonomic classification toolkit (GTDB-Tk) have been widely adopted by the microbiology community. However, the growing size of the GTDB bacterial reference tree has resulted in GTDB-Tk requiring substantial amounts of memory (∼320 GB) which limits its adoption and ease of use. Here, we present an update to GTDB-Tk that uses a divide-and-conquer approach where user genomes are initially placed into a bacterial reference tree with family-level representatives followed by placement into an appropriate class-level subtree comprising species representatives. This substantially reduces the memory requirements of GTDB-Tk while having minimal impact on classification.

Availability and implementation

GTDB-Tk is implemented in Python and licenced under the GNU General Public Licence v3.0. Source code and documentation are available at: https://github.com/ecogenomics/gtdbtk.

Supplementary information

Supplementary data are available at Bioinformatics online.

1 Introduction

The Genome Taxonomy Database (GTDB) and associated taxonomic classification toolkit (GTDB-Tk) has been used to assign taxonomic classifications to tens of thousands of bacterial and archaeal isolate genomes and metagenome-assemble genomes recovered from environmental and human-associated samples (Almeida et al., 2021; Chaumeil et al., 2019; Nayfach et al., 2021). These classifications are consistent with the GTDB framework and based on the same relative evolutionary divergence (RED) and average nucleotide identity (ANI) criteria for circ*mscribing taxa (Parks et al., 2020, 2022). A primary step in assigning classifications is placing genomes into the GTDB bacterial or archaeal reference trees using the maximum likelihood (ML) placement tool pplacer (Matsen et al. 2010). Unfortunately, ML placement with pplacer is a memory-intensive operation requiring ∼320 GB of RAM when using the GTDB R07-RS207 bacterial reference tree comprised of 62291 genomes. Adding to this challenge is the lack of favourable alternatives to pplacer as the EPA-ng ML placement method requires more memory than pplacer and distance-based methods such as APPLES-2 have inferior performance (Balaban et al., 2022; Barbera et al., 2019; Koning et al., 2021). The GTDB bacterial reference tree has been growing rapidly in size with each GTDB release and is expected to grow by upwards of 30% per year for the next few years (Parks et al., 2022). Unfortunately, the size of this tree results in the memory requirements of GTDB-Tk being impractical. Here, we show that the memory requirements of GTDB-Tk are reduced by dividing the GTDB bacterial reference tree into class-level subtrees and demonstrate that taxonomic classifications are largely unimpacted by this change.

2 Materials and methods

GTDB-Tk v2 divides the GTDB bacterial reference tree into class-level subtrees to reduce memory requirements. Placement of a genome with pplacer now consists of two steps. First, a genome is placed into a backbone tree consisting of a single genome representative for each family (see Supplementary Methods). If the genome is assigned to a class within this backbone tree, it is then placed into a class-level subtree to obtain a more refined placement for the genome.

The class-level subtrees were constructed in a greedy manner with the maximum size of a class-level subtree being set based on the number of species representatives in the largest class (Gammaproteobacteria with 9582 genomes in GTDB R07-RS207). Each class-level subtree was formed by selecting the largest class in the reference tree and traversing towards the root until the subtree contained at most 10540 genomes (10% more than the Gammaproteobacteria). This subtree was then pruned from the reference tree and the procedure repeated until all classes were assigned to a class-level subtree. For GTDB R07-RS207, this resulted in seven class-level trees. Each class-level subtree was then expanded to contain a single genome from each phylum in order to allow query genomes to be placed as the most basal member of a class.

Final taxonomic classifications use the same RED and ANI criterion as GTDB-Tk v1 (Chaumeil et al., 2020) with the following additional rules:

  1. A genome not placed into a class-level subtree is assigned the classification determined in the backbone tree.

  2. A genome placed into a class-level subtree and assigned to a phylum belonging to one of the classes contained in the subtree is assigned the classification determined in the subtree.

  3. Otherwise, the genome is classified by taking the lowest common ancestor between the backbone and class-level subtree.

3 Results

Here, we demonstrate that the taxonomic classifications produced by the divide-and-conquer approach implemented in GTDB-Tk v2 are nearly equivalent to those produced by GTDB v1 while providing a substantial reduction in required memory.

3.1 Similarity of classifications on diverse sets of genomes

The concordance between GTDB-Tk v1 and v2 classifications was first assessed using 16710 bacterial genomes from the GEMs dataset (Nayfach et al., 2021) that represent novel taxa relative to GTDB R07-RS207 (Table1). Only 12 genomes (0.07%) did not have identical classifications between GTDB-Tk v1 and the divide-and-conquer approach used in GTDB-Tk v2 (Supplementary Table S1). The majority of incongruence was due to genomes being over- (six genomes) or under-classified (four genomes) by a single taxonomic rank. Only two genomes had conflicting taxonomic assignments, and these were both relatively poor-quality genomes assigned as new classes in alternative phyla (Supplementary Table S1).

Table 1.

Open in new tab

Novelty of GEM genomes relative to GTDB R07-RS207 based on GTDB-Tk v1 classifications

GTDB-Tk v2 classifications relative to GTDB-Tk v1 classifications
Taxon noveltyNo. genomesCongruentConflictaUnderclassifiedbOverclassifiedc
Novel phylum32001
Novel class4236222
Novel order144143001
Novel family543540012
Novel genus32223221010
Novel species1275612756000
GTDB-Tk v2 classifications relative to GTDB-Tk v1 classifications
Taxon noveltyNo. genomesCongruentConflictaUnderclassifiedbOverclassifiedc
Novel phylum32001
Novel class4236222
Novel order144143001
Novel family543540012
Novel genus32223221010
Novel species1275612756000

Note: GTDB-Tk v2 predicts a different taxon (a), or less (b) or more (c) resolved classifications than GTDB-Tk v1 (see Supplementary Table S1).

Table 1.

Open in new tab

Novelty of GEM genomes relative to GTDB R07-RS207 based on GTDB-Tk v1 classifications

GTDB-Tk v2 classifications relative to GTDB-Tk v1 classifications
Taxon noveltyNo. genomesCongruentConflictaUnderclassifiedbOverclassifiedc
Novel phylum32001
Novel class4236222
Novel order144143001
Novel family543540012
Novel genus32223221010
Novel species1275612756000
GTDB-Tk v2 classifications relative to GTDB-Tk v1 classifications
Taxon noveltyNo. genomesCongruentConflictaUnderclassifiedbOverclassifiedc
Novel phylum32001
Novel class4236222
Novel order144143001
Novel family543540012
Novel genus32223221010
Novel species1275612756000

Note: GTDB-Tk v2 predicts a different taxon (a), or less (b) or more (c) resolved classifications than GTDB-Tk v1 (see Supplementary Table S1).

GTDB-Tk v1 and v2 classifications were further evaluated by dereplicating the ∼60000 genomes introduced in GTDB R07-RS207 to 23548 genomes by randomly selecting a single genome per species. These 23548 genomes were then classified using the GTDB-Tk R06-RS202 reference package to further evaluate classifications on genomes with varying degrees of taxonomic novelty and to ensure results were robust with different GTDB reference packages (Supplementary Table S2). Only 13 genomes (0.06%) had different GTDB-Tk v1 and GTDB-Tk v2 classifications with 5 having conflicting assignments, 5 being overclassified and 3 being underclassified (Supplementary Table S3).

3.2 Reduced memory requirements

The divide-and-conquer approach implemented in GTDB-Tk v2 reduced the maximum memory requirements from ∼320 GB to <55 GB when run with the GTDB R07-RS207 reference trees. GTDB-Tk v2 also ran 22–35% faster when processing 1000 genomes with 1–64 CPUs (Supplementary Fig. S1A) and was >40% faster when processing 5000 genomes using 32 CPUs (Supplementary Fig. S1B).

4 Summary

GTDB-Tk v2 requires only a sixth of the memory of GTDB-Tk v1 while providing almost identical classifications. More importantly, the divide-and-conquer approach used in GTDB-Tk v2 allows memory requirements to be controlled by tailoring the size of the largest subtree. This ensures GTDB-Tk can continue to be used on readily available computing hardware even as the size of the GTDB bacterial reference tree increases.

Acknowledgements

We thank Morgan Price for insightful advice on using FastTree with pplacer, Brian Kemish for his help in maintaining our computing infrastructure, Maria Chuvochina and Christian Rinke for helpful discussions, and the GTDB-Tk community for their bug reports and suggestions on features to improve GTDB-Tk.

Funding

This work was supported by UQ Strategic Funding and Australian Research Council Laureate Fellowship [FL150100038].

Conflict of Interest: none declared.

References

Almeida

A.

et al. (

2021

)

A unified catalog of 204,938 reference genomes from the human gut microbiome

.

Nat. Biotechnol

.,

39

,

105

114

.

Balaban

M.

et al. (

2022

)

Fast and accurate distance-based phylogenetic placement using divide and conquer

.

Mol. Ecol. Resour

.,

22

,

1213

1227

.

Barbera

P.

et al. (

2019

)

EPA-ng: massively parallel evolutionary placement of genetic sequences

.

Syst. Biol

.,

68

,

365

369

.

Chaumeil

P.-A.

et al. (

2019

)

GTDB-Tk: a toolkit to classify genomes with the genome taxonomy database

.

Bioinformatics

,

36

,

1925

1927

.

Google Scholar

OpenURL Placeholder Text

Matsen

F.A.

et al. (

2010

)

Pplacer: linear time maximum-likelihood and Bayesian phylogenetic placement of sequences onto a fixed reference tree

.

BMC Bioinformatics

,

11

,

538

.

Nayfach

S.

et al.;

IMG/M Data Consortium

. (

2021

)

A genomic catalog of earth’s microbiomes

.

Nat. Biotechnol

.,

39

,

499

509

.

Parks

D.H.

et al. (

2020

)

A complete domain-to-species taxonomy for bacteria and archaea

.

Nat. Biotechnol

.,

38

,

1079

1086

.

Parks

D.H.

et al. (

2022

)

GTDB: an ongoing census of bacterial and archaeal diversity through a phylogenetically consistent, rank normalized and complete genome-based taxonomy

.

Nucleic Acids Res

.,

50

,

D785

D794

.

© The Author(s) 2022. Published by Oxford University Press.

This is an Open Access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.

Associate Editor: Karsten Borgwardt

Karsten Borgwardt

Associate Editor

Search for other works by this author on:

Oxford Academic


Download all slides

  • Supplementary data

  • Supplementary data

    Advertisem*nt

    Citations

    Views

    11,343

    Altmetric

    More metrics information

    Metrics

    Total Views 11,343

    8,466 Pageviews

    2,877 PDF Downloads

    Since 10/1/2022

    Month: Total Views:
    October 2022 485
    November 2022 729
    December 2022 840
    January 2023 543
    February 2023 647
    March 2023 730
    April 2023 608
    May 2023 472
    June 2023 353
    July 2023 405
    August 2023 416
    September 2023 328
    October 2023 459
    November 2023 483
    December 2023 449
    January 2024 465
    February 2024 568
    March 2024 589
    April 2024 574
    May 2024 575
    June 2024 498
    July 2024 127

    Citations

    Powered by Dimensions

    228 Web of Science

    Altmetrics

    ×

    Email alerts

    Article activity alert

    Advance article alerts

    New issue alert

    In progress issue alert

    Receive exclusive offers and updates from Oxford Academic

    Citing articles via

    Google Scholar

    • Latest

    • Most Read

    • Most Cited

    scMaSigPro: Differential Expression Analysis along Single-Cell Trajectories
    GALEON: A Comprehensive Bioinformatic Tool to Analyse and Visualise Gene Clusters in Complete Genomes
    LarvaTagger: Manual and automatic tagging of drosophila larval behaviour
    InstaPrism: an R package for fast implementation of BayesPrism
    mLiftOver: harmonizing data across infinium DNA methylation platforms

    More from Oxford Academic

    Bioinformatics and Computational Biology

    Biological Sciences

    Science and Mathematics

    Books

    Journals

    Advertisem*nt

    GTDB-Tk v2: memory friendly classification with the genome taxonomy database (2024)

    References

    Top Articles
    Latest Posts
    Article information

    Author: Zonia Mosciski DO

    Last Updated:

    Views: 6302

    Rating: 4 / 5 (71 voted)

    Reviews: 94% of readers found this page helpful

    Author information

    Name: Zonia Mosciski DO

    Birthday: 1996-05-16

    Address: Suite 228 919 Deana Ford, Lake Meridithberg, NE 60017-4257

    Phone: +2613987384138

    Job: Chief Retail Officer

    Hobby: Tai chi, Dowsing, Poi, Letterboxing, Watching movies, Video gaming, Singing

    Introduction: My name is Zonia Mosciski DO, I am a enchanting, joyous, lovely, successful, hilarious, tender, outstanding person who loves writing and wants to share my knowledge and understanding with you.