Robust Expansion of Phylogeny for Fast-growing Genome Sequence Data

Yongtao Ye, Marcus H. Shum, Joseph L. Tsui, Guangchuang Yu, David K. Smith, Huachen Zhu, Joseph T. Wu, Yi Guan, Tommy Tsan-Yuk Lam

Abstract

Massive sequencing of SARS-CoV-2 genomes has urged novel methods that employ existing phylogenies to add new samples efficiently instead of de novo inference. ‘TIPars’ was developed for such challenge integrating parsimony analysis with pre-computed ancestral sequences. It took about 21 seconds to insert 100 SARS-CoV-2 genomes into a 100k-taxa reference tree using 1.4 gigabytes. Benchmarking on four datasets, TIPars achieved the highest accuracy for phylogenies of moderately similar sequences. For highly similar and divergent scenarios, fully parsimony-based and likelihood-based phylogenetic placement methods performed the best respectively while TIPars was the second best. TIPars accomplished efficient and accurate expansion of phylogenies of both similar and divergent sequences, which would have broad biological applications beyond SARS-CoV-2. TIPars is accessible from https://tipars.hku.hk/ and source codes are available at https://github.com/id-bioinfo/TIPars.

Introduction

Next-generation sequencing (NGS) technology enables large-scale exploration of the diversity and monitoring temporal evolution of organisms, which often involve generating and analysing large numbers of sequences from new organisms on an ongoing basis. For instance, more than 15 million SARS-CoV-2 genomes have been sequenced within two years of the pandemic [1], crucial in transmission tracking and disease control. Conventional methods of de novo phylogeny inference, such as those implemented in IQ-TREE2 [2] and FastTree2 [3], that build the tree from scratch after collecting all relevant sequences, are unsuitable for such rapidly growing huge sequence datasets. Therefore, placing new sequences directly in existing reference trees becomes a more efficient alternative. Such ‘phylogenetic placement’ has been useful for taxonomic classification, while accumulative addition of sequences (incrementing the phylogeny as a result) allows efficient update of the growing phylogeny in a global context.

Materials and methods

Implementation of TIPars

After assigning the ancestral sequences at every internal node and taxa sequences at external nodes, TIPars inserts a set of new samples into the reference phylogenetic tree sequentially based on parsimony criteria.

For a query sequence Q, TIPars computes the minimal substitution score against all branches in the tree. While inserting query Q into a branch A-B (parent node—child node) at a potential newly added node P (Fig 1A), the substitution score is the sum of mutations by which query Q differs from both node A and node B based on a specific substitution scoring table (based on the IUPAC nucleotide ambiguity codes for nucleotides, S5 Table or the BLOSUM62 scoring matrix [33] for amino acids, S6 Table). For example, considering the mutations for inserting Q (ACGT) into the branch between node A (ACCG) and node B (ACGC) (Fig 1A), there is only 1 mutation that is at the 4th site of Q where the genetic character ‘T’ differs to both characters ‘G’ in node A and ‘C’ in node B. The single branch with the minimum substitution score σ is reported as the best placement.

Results

Computational performance of TIPars and other methods

A number of approaches have been proposed for phylogenetic placement or insertion but are impractical or computationally prohibitive for dealing with the vast number of SARS-CoV-2 genome sequences. We generated a reference SARS-CoV-2 phylogenetic tree (SARS2-100k) from 96,020 unmasked, high-quality SARS-CoV-2 sequences (detailed in Methods), and evaluated our program alongside UShER, EPA-ng, APPLES-2, IQ-TREE2, RAPPAS, PAGAN2 and MAPLE by sequentially inserting 100 new sequences. Table 1 presents the summary of runtimes for running the whole program as well as the placement/insertion process that is the key step of searching the best position to place or insert the 100 queries into the reference tree. Placement/insertion time is not available for the tools APPLES-2, IQ-TREE2, RAPPAS and PAGAN2. Only TIPars, UShER and MAPLE were practicable in terms of insertion time and memory usage. RAPPAS and PAGAN2 were unable to complete within 96 hours, hence, no data were available. Although IQ-TREE2 required the longest runtime among all programs, it had a lower peak memory than EPA-ng, which used about 1 terabyte (TB)—impractical for general use (Table 1). In contrast, TIPars took only 21 seconds on a 64-core server and required only about 1.4 gigabytes (GB) peak memory usage while those of MAPLE were 33 seconds but 24.04 GB.

Acknowledgments

We gratefully acknowledge the following Authors from the Originating laboratories responsible for obtaining the specimens and the Submitting laboratories where genetic sequence data were generated and shared via GISAID Initiative, on which this research is based. A full acknowledgement table can be found with four EPI_SET-IDs, i.e., EPI_SET_20220531kz, EPI_SET_20211201vz, EPI_SET_20211206tc and EPI_SET_20220701rg, in Data Acknowledgement Locator under GISAID resources (https://www.gisaid.org/). We thank Vivian Leung for her edits and comments on the manuscript.

Citation: Ye Y, Shum MH, Tsui JL, Yu G, Smith DK, Zhu H, et al. (2024) Robust expansion of phylogeny for fast-growing genome sequence data. PLoS Comput Biol 20(2): e1011871. https://doi.org/10.1371/journal.pcbi.1011871

Editor: Joel O. Wertheim, University of California San Diego, UNITED STATES

Received: September 5, 2023; Accepted: January 29, 2024; Published: February 8, 2024

Copyright: © 2024 Ye et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Data Availability: The benchmark datasets and source codes are available at https://github.com/id-bioinfo/TIPars. SARS2-CoV-2 data used in this work were all downloaded from GISAID (https://www.gisaid.org/) of which Accession Ids are shown in the directories SARS2-100k and SARS2-660k under https://github.com/id-bioinfo/TIPars/tree/master/Benchmark_datasets/.

Funding: This project is supported by the National Natural Science Foundation of China’s Excellent Young Scientists Fund (Hong Kong and Macau) (31922087; TL), the Hong Kong Research Grants Council’s General Research Fund (17150816; TL), the Health and Medical Research Fund (COVID1903011-WP1; TL), the Innovation and Technology Commission’s InnoHK funding (D24H; TL,JW,YG,HZ), and the Guangdong Government for the funding supports (2019B121205009, HZQB-KCZYZ-2021014, 200109155890863, 190830095586328 and 190824215544727; YG,HZ). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Competing interests: The authors have declared that no competing interests exist.

https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1011871#abstract0

patheon - Mastering API production at every scale

7th Pharma Packaging and Labelling Forum 2024

Sartorius Webinar - Pave Your Weigh to Accurate Analytical Results