A mixture of attention experts-embedded flow-based generative model to create synthetic cells in single-cell RNA-Seq datasets

Sultan Sevgi Turgut Ögme, Nizamettin Aydin, Zeyneb Kurt

Abstract

Single-cell RNA-seq (scRNAseq) analyses performed at the cellular level aim to understand the cellular landscape of tissue sections, offer insights into rare cell-types, and identify marker genes for annotating distinct cell types. ScRNAseq analyses are widely applied to cancer research to understand tumor heterogeneity, disease progression, and resistance to therapy. Single-cell data processing is a challenging task due to its high-dimensionality, sparsity, and having imbalanced class(cell-type) distributions.

Introduction

Single-cell RNA-sequencing (scRNAseq) analysis aims to identify different cell types that make up tissues or tumour microenvironments [1] as well as the marker genes that can distinguish particular cell types from others. Various software platforms and tools have been presented for the implementation and evaluation of these analyses [2,3]. Cell type and marker gene determination usually needs manual operations and is quite time-consuming. Therefore, in recent years, emphasis has been placed on automating these steps.

Materials and method

4.1. Preprocessing datasets

Relatively limited attention is paid to pancreatic tissue in the literature compared to other tissues. To conduct a comprehensive large-scaleanalysis through integrating various datasets, the human pancreas scRNAseq datasets were curated from sources including GEO Repository [46]. Among the datasets frequently used in computational studies, Baron [4], Muraro [5], Segerstolpe [6], and Xin [43] were selected for our large-scaleanalysis. 

Results

Fig 1 presents our study’s overall framework. Our workflow consists of pancreatic tissue dataset curation, data preprocessing, proposed model framework, synthetic cell generation, benchmarking with different generative models and external datasets, evaluation with discrepancy metrics and cell-type classification. It also incorporates marker gene identification and cell-cell interaction analysis as key components in downstream analyses.

Conclusions and discussions

Preprocessing steps in scRNAseq analysis are crucial for identifying cell types, because single cell data is sparse, has large-dimensionality and imbalanced class distributions. Although there are various publicly available scRNAseq datasets in the literature, certain cell types are under-represented and form a minority category, so especially the automated identification of these cell groups is challenging. To address these issues, generative models have been proposed and used in recent years. 

Citation: Turgut Ögme SS, Aydin N, Kurt Z (2025) A mixture of attention experts-embedded flow-based generative model to create synthetic cells in single-cell RNA-Seq datasets. PLoS Comput Biol 21(10): e1013525. https://doi.org/10.1371/journal.pcbi.1013525

Editor: Hatice Ulku Osmanbeyoglu, University of Pittsburgh, UNITED STATES OF AMERICA

Received: January 8, 2025; Accepted: September 15, 2025; Published: October 6, 2025

Copyright: © 2025 Turgut Ögme et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Data Availability: There are no primary data generated or processed in the paper. The datasets analyzed in this study are publicly available in the Gene Expression Omnibus (GEO) repository (https://www.ncbi.nlm.nih.gov/geo/) with accession numbers GSE84133, GSE85241, and GSE83139, and in ArrayExpress repository with accession number E-MTAB-5061. Additionally, the PBMC3K (https://cf.10xgenomics.com/samples/cell/pbmc3k/pbmc3k_filtered_gene_bc_matrices.tar.gz) is publicly accessible from 10x Genomics. HCA-BM (https://bioconductor.org/packages/release/data/experiment/html/HCAData.html) dataset was obtained from the HCAData package in Bioconductor (https://www.bioconductor.org/), and the PBMC68K (https://activa-material.s3.amazonaws.com/PreprocessedData/68kPBMC_preprocessed.h5ad) dataset was downloaded from the ACTIVA repository (https://github.com/SindiLab/ACTIVA). The code supporting the findings of this study is available at https://github.com/sevgiturgut/Comparison-Generative-Models.

Funding: The author(s) received no specific funding for this work.

Competing interests: The authors have declared that no competing interests exist.