Pharma Focus Europe
ThermoFisher Scientific - Custom and Bulks

Identification of Monotonically Expressed Long Non-coding RNA Signatures for Breast Cancer Using Variational Autoencoders

Dongjiao Wang, Ling Gao, Xinliang Gao, Chi Wang, Suyan Tian 

Abstract

As breast cancer is a multistage progression disease resulting from a genetic sequence of mutations, understanding the genes whose expression values increase or decrease monotonically across pathologic stages can provide insightful clues about how breast cancer initiates and advances. Utilizing variational autoencoder (VAE) networks in conjunction with traditional statistical testing, we successfully ascertain long non-coding RNAs (lncRNAs) that exhibit monotonically differential expression values in breast cancer. Subsequently, we validate that the identified lncRNAs really present monotonically changed patterns. The proposed procedure identified 248 monotonically decreasing expressed and 115 increasing expressed lncRNAs. They correspond to a total of 65 and 33 genes respectively, which possess unique known gene symbols. Some of them are associated with breast cancer, as suggested by previous studies. Furthermore, enriched pathways by the target mRNAs of these identified lncRNAs include the Wnt signaling pathway, human papillomavirus (HPV) infection, and Rap 1 signaling pathway, which have been shown to play crucial roles in the initiation and development of breast cancer.

Introduction

Breast cancer is the most commonly diagnosed cancer and the most frequent cause of cancer death among women worldwide [1]. Since cancer is a multistage progression process resulting from genetic sequence mutations [2], the genes whose expression values increase or decrease monotonically across stages are expected to play essential roles in the tumor progression and metastasis. The identification of these monotonically differentially expressed genes (MEGs) offers valuable insights into the progression of cancer. Several studies have investigated MEGs across stages for a variety of cancer types, such as lung cancer [3, 4], colon cancer [4, 5], and liver cancer [6]. Certainly, breast cancer is also included [7].

Materials and methods

Experimental data

The gene expression profiles and corresponding clinical information of the breast cancer (BRCA) cohort from TCGA were obtained from the TCGA’s Genomic Data Commons (http://www.cbioportal.org/). The cohort comprised 140 patients at stage I, 480 patients at stage II, 180 patients at stage III, and 105 para-carcinoma tissues serving as normal controls, which were included for the downstream analysis. Of note, the patients at stage IV were excluded due to the limited number of individuals (only 14 patients) in this advanced stage within the TCGA BRCA cohort.

Results

First, the DEGs were identified using moderated t-tests. Setting the cutoff value of the false discovery rate (FDR) at 0.01, for the first comparison in which the patients at stage I and the normal controls were considered, 2,629 lncRNAs were identified as down-regulated DEGs and 838 lncRNAs as up-regulated DEGs, respectively. For the second comparison (stage II versus control), there were 2,966 and 1,205 DEGs, while for the last comparison (stage III versus control), the numbers were 2,970 and 887, respectively. Interestingly, the number of down-regulated DEGs was found to be approximately two times higher than the number of up-regulated DEGs. Then we put additional restrictions (as suggested in the Method section) in order to discard other patterns rather than the monotonic ones. This resulted in 248 monotonically decreasing DEGs and 115 monotonically increasing DEGs, which were input into the VAE models for further justification.

Discussion

We acknowledge the present study has several limitations. First, it is well known that breast cancer is a very heterogeneous disease, and different subtypes have district underlying molecular mechanisms and prognoses. The subgroup analyses were not considered in this study given that with a further stratification on subtypes would increase the likelihood of over-fitting dramatically (resulting from the fact that the number of samples in each stratum would drop substantially). Furthermore, it is worth noting that patients at stage IV were excluded in this study due to the limited number of patients at this specific stage (only 14) within the TCGA BRCA cohort. While understanding the progression of breast cancer to this advanced stage is of significant interest, a large-scale study is highly desirable. Additionally, the deep learning model considered in this study, specifically a VAE model with 7 layers can be considered relatively shallow in terms of its complexity. Given that the sample size for each stratum ranges approximately from 100 to 500, such sizes of discovery data are insignificant to train a deep learning model enclosed with a remarkably large number of hidden layers.

Conclusions

A significant contribution of this study is the ingenious combination of a VAE network, which generates a representation feature representing all genes with a specific change pattern with the classical approach of statistical hypothesis testing. This integrated analysis enables the identification and subsequent validation of monotonically changing genes across different stages of breast cancer. Thus, it enlightens us on a potential stretch for the application of deep learning methods.

Our analysis results show that the identified lncRNAs are indeed changed monotonically across stages and have good biological implications in breast cancer. Experimental validation of these lncRNAs with a large-scale study is warranted in the future. In conclusion, the proposed procedure is highly recommended.

Citation: Wang D, Gao L, Gao X, Wang C, Tian S (2023) Identification of monotonically expressed long non-coding RNA signatures for breast cancer using variational autoencoders. PLoS ONE 18(8): e0289971. https://doi.org/10.1371/journal.pone.0289971

Editor: Jinhui Liu, The First Affiliated Hospital of Nanjing Medical University, CHINA

Received: March 1, 2023; Accepted: July 29, 2023; Published: August 10, 2023

Copyright: © 2023 Wang et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Data Availability: The expression profiles of the TCGA breast cancer (BRCA) cohort and the clinical information was downloaded from the TCGA’s Genomic Data Commons (http://www.cbioportal.org/). The data used in the validation were downloaded from the GEO database under the accession number of GSE42568 (https://ncbi.nlm.nih.gov/gds/?term=GSE42568). All data are open and publicly available.

Funding: This study was supported by a fund (JJKH20190032KJ) from the Education Department of Jilin Province. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Competing interests: The authors have declared that no competing interests exist.

 

https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0289971#abstract0

 

magazine-slider-img
patheon - Mastering API production at every scale7th Pharma Packaging and Labelling Forum 2024Future Labs Live - 2024World Orphan Drug Congress 2024World Vaccine Congress Europe 2024Sartorius Webinar - Pave Your Weigh to Accurate Analytical ResultsEUROPEAN PHARMA OUTSOURCING SUMMIT 2024patheon - Revolutionizing PharmaWorld Vaccine Congress Europe 2024