Leveraging Epigenomes and Three-dimensional Genome Organization for Interpreting Regulatory Variation
Brittany Baur, Junha Shin, Jacob Schreiber, Shilu Zhang, Yi Zhang,Mohith Manjunath, Jun S. Song,William Stafford Noble, Sushmita Roy
Abstract
Understanding the impact of regulatory variants on complex phenotypes is a significant challenge because the genes and pathways that are targeted by such variants and the cell type context in which regulatory variants operate are typically unknown. Cell-type-specific long-range regulatory interactions that occur between a distal regulatory sequence and a gene offer a powerful framework for examining the impact of regulatory variants on complex phenotypes. However, high-resolution maps of such long-range interactions are available only for a handful of cell types. Furthermore, identifying specific gene subnetworks or pathways that are targeted by a set of variants is a significant challenge. We have developed L-HiC-Reg, a Random Forests regression method to predict high-resolution contact counts in new cell types, and a network-based framework to identify candidate cell-type-specific gene networks targeted by a set of variants from a genome-wide association study (GWAS). We applied our approach to predict interactions in 55 Roadmap Epigenomics Mapping Consortium cell types, which we used to interpret regulatory single nucleotide polymorphisms (SNPs) in the NHGRI-EBI GWAS catalogue.
Introduction
Genome-wide association studies (GWAS) have identified a large number of variants associated with different phenotypes and diseases [1]. Approximately, 93% of all GWAS variants are regulatory variants, located in non-coding regions that can regulate gene expression, with nearly 20% located over 100kb away from any genic feature [2,3]. Understanding the mechanisms by which such variants contribute to phenotypic variation is a significant challenge because the target genes and pathways of non-coding variants as well as the specific cell types in which these variants operate are unknown. Recent studies have shown that regulatory sequences such as enhancers can harbor non-coding variants that impact gene expression [4–7] in a cell-type-specific manner [8,9]. Three-dimensional organization of the genome enables long-range regulatory interactions between distal enhancers and genes through chromosomal looping that brings the enhancer in spatial proximity to target genes.
Materials and methods
L-HiC-Reg is based on HiC-Reg, a Random Forests regression approach that predicts contact counts using one-dimensional regulatory genomic data sets, e.g. histone modifications and architectural proteins, and chromatin accessibility [18]. Because the focus of L-HiC-Reg is to generalize across a large number of cell types, we made several modifications to the original HiC-Reg approach which improved performance. First, we used models trained on a smaller number of datasets to maximize the number of cell types for which we can make predictions. Second, L-HiC-Reg uses discrete features (described below) compared to HiC-Reg which uses continuous features to make the features more comparable across cell types in training and prediction tasks. Lastly to train the L-HiC-Reg models, we first segmented a chromosome into non-overlapping 1Mb segments and trained a Random Forests regression model for each adjacent 1Mb segment using high-resolution (5kb) Hi-C SQRTVC normalized data downloaded from Rao et al [31], although based on our previous work, other normalization methods such as Knight-Ruiz matrix balancing or Iterative Correction and Eigen vector decomposition (ICE) can be used as well. For a given 1Mb segment, the training set included all 5kb region pairs in which one or both of the 5kb regions in the pair was inside the 1Mb segment (Fig 1A).
Results
We previously developed HiC-Reg, an approach to computationally predict contact counts based on one-dimensional signals such as histone marks and transcription factor binding sites [18]. HiC-Reg performs well across chromosomes within the same cell type; however, predicting across cell types is still challenging and requires a large number of one-dimensional signals. We hypothesized that 3D genome conformation may be driven by different factors at different genomic loci. While a Random Forests prediction model should be able to capture multiple combinations of factors, a predictive model trained in a locus-specific manner can be more expressive and capture more nuanced dependencies than a global model for a whole chromosome.
Discussion
Interpreting non-coding variation and how it impacts phenotypic variation is a significant challenge because of our limited knowledge of which genes and pathways these variants act upon [56]. Long-range regulatory interactions between distal sequence elements and genes are emerging as a major mechanism by which variants can impact gene expression [4–7]. However, high-resolution maps of such interactions are missing for most cell types and biological contexts due to the cost of generating high-resolution Hi-C datasets. Furthermore, tools to identify pathways that are impacted by a set of variants in a cell type specific manner are limited.
Acknowledgments
We would like to thank Erika Da-Inn Lee for assistance with visualizations for this project. We would also like to thank the Center for High-Throughput Computing (CHTC) at UW-Madison for providing resources to complete this project.
Citation: Baur B, Shin J, Schreiber J, Zhang S, Zhang Y, Manjunath M, et al. (2023) Leveraging epigenomes and three-dimensional genome organization for interpreting regulatory variation. PLoS Comput Biol 19(7): e1011286. https://doi.org/10.1371/journal.pcbi.1011286
Editor: Shihua Zhang, Academy of Mathematics and Systems Science, Chinese Academy of Science, CHINA
Received: October 29, 2022; Accepted: June 20, 2023; Published: July 10, 2023
Copyright: © 2023 Baur et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Data Availability: Predicted significant interactions, SNP-gene interactions, node and edge scores, networks, transitioning gene sets and all code associated with this project can be found at: https://github.com/Roy-lab/Roadmap_RegulatoryVariation. A user-friendly interface is available at: https://regvar-networks.wid.wisc.edu/. A network-based Shiny app is available at https://net-based-regvar-interpretation.wid.wisc.edu/.
Funding: This work was supported by the Genomics Sciences Training Program at UW-Madison (NHGRI 5T32HG002760) for BB, NHGRI R01 grant R01-HG010045-01 for SR and BB, the Center for Predictive Computational Phenotyping (NIH BD2K U54 AI117924) for BB and SR, NIH award U24HG009446 for WSN and JSc, NIH R01 grant R01-CA163336 for JSS, YZ, and MM, and James McDonell Foundation Grant 3194-133-349500-4-AAB5159 for SR and JSh. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
Competing interests: The authors have declared that no competing interests exist.
https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1011286#abstract0