Topological stratification of continuous genetic variation in large biobanks
Alex Diaz-Papkovich, Shadi Zabad, Hannah Snell, Chief Ben-Eghan, Luke Anderson-Trocmé, Georgette Femerling, Vikram Nathan, Jenisha Patel, Simon Gravel\
Abstract
Biobanks now contain genetic data from millions of individuals. Dimensionality reduction, visualization and clustering are standard when exploring data at these scales; while efficient and tractable methods exist for the first two, clustering remains challenging because of the many ways in which demography and sampling can affect structure. In practice, clustering is commonly performed by drawing shapes around dimensionally reduced data or assuming populations have “type” genomes or allele frequencies that represent a population.
Introduction
Following improvements in genomic technologies, large-scale biobanks have become commonplace. The Global Biobank Meta-analysis Initiative (GBMI), for example, lists 23 biobanks with genetic data and health records from over 2.2 million individuals [1]. The growth in sample sizes has led to increased potential for scientific findings, with thousands of genetic loci implicated with phenotypes in genome-wide association studies (GWAS), and used to predict disease traits via polygenic scores (PGS).
Materials and method
We propose to perform clustering based on distances computed in genotype space, that is, where each individual is represented by a vector of allele counts for genetic variants. We will use Hierarchical Density-Based Spatial Clustering of Applications with Noise (HDBSCAN) for clustering.
Results
Clustering captures population structure from sample design
The 1KGP’s relatively balanced global sample design makes it useful for testing algorithms to identify population structure. We have previously shown that UMAP results in clear visual clusters from 1KGP data in two dimensions [4]. Fig 4 shows a UMAP representation of the 1KGP. Fig 4a shows the data without population labels (to mimic data with unknown populations), Fig 4b shows the data with corresponding population labels from the 1KGP, and Fig 4c shows the data with cluster labels generated by HDBSCAN( ) run on a 5D UMAP.
Discussion
We present UMAP-HDBSCAN( ), a new approach to describe population structure that approximates the topology of high-dimensional genetic data and detects dense clusters in a low-dimensional space. Among its distinguishing characteristics:
Acknowledgments
We are grateful to the participants in each biobank who provided their genetic data. We thank the CARTaGENE team for troubleshooting data with us, and C. Bhérer, M. L. Spear, and P. Verdu for scientific discussion.
Citation: Diaz-Papkovich A, Zabad S, Snell H, Ben-Eghan C, Anderson-Trocmé L, Femerling G, et al. (2026) Topological stratification of continuous genetic variation in large biobanks. PLoS Genet 22(3): e1012068. https://doi.org/10.1371/journal.pgen.1012068
Editor: David Balding, University of Melbourne, AUSTRALIA
Received: July 8, 2025; Accepted: February 24, 2026; Published: March 16, 2026
Copyright: © 2026 Diaz-Papkovich et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Data Availability: All code is available at https://github.com/diazale/topstrat. 1000 Genomes Project data is available to the public. We used the genotype file ALL.wgs.nhgri_coriell_affy_6.20140825.genotypes_has_ped.vcf.gz and population labels affy_samples.20141118.panel 20131219.populations.tsv, available at http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/supporting/hd_genotype_chip/ UKB genotype data can be accessed through the process specified at: https://www.ukbiobank.ac.uk/about-our-data/types-of-data/genetic-data/ CARTaGENE data is accessible at https://cartagene.qc.ca/en/researchers/access-request.html after scientific and ethical review.
Funding: This research was supported by the Canadian Institute for Health Research (CIHR; https://cihr-irsc.gc.ca/e/193.html) project grant 437576, Natural Sciences and Engineering Research Council of Canada (NSERC; https://www.nserc-crsng.gc.ca/) grant RGPIN-2017-04816, the Canada Research Chair program (https://www.chairs-chaires.gc.ca/home-accueil-eng.aspx), and the Canada Foundation for Innovation (https://www.innovation.ca/). These awards were received by S.G. The research was also supported by NSERC PDF-599527-2025, received by A.D.P. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
Competing interests: The authors have declared that no competing interests exist.


