Revvity Signals - Drug Discovery

Deep Self-supervised Learning for Biosynthetic Gene Cluster Detection and Product Classification

Natural products are chemical compounds that form the basis of many therapeutics used in the pharmaceutical industry. In microbes, natural products are synthesized by groups of colocalized genes called biosynthetic gene clusters (BGCs).

Abstract

Natural products are chemical compounds that form the basis of many therapeutics used in the pharmaceutical industry. In microbes, natural products are synthesized by groups of colocalized genes called biosynthetic gene clusters (BGCs). With advances in high-throughput sequencing, there has been an increase of complete microbial isolate genomes and metagenomes, from which a vast number of BGCs are undiscovered. Here, we introduce a self-supervised learning approach designed to identify and characterize BGCs from such data. To do this, we represent BGCs as chains of functional protein domains and train a masked language model on these domains. We assess the ability of our approach to detect BGCs and characterize BGC properties in bacterial genomes. We also demonstrate that our model can learn meaningful representations of BGCs and their constituent domains, detect BGCs in microbial genomes, and predict BGC product classes. These results highlight self-supervised neural networks as a promising framework for improving BGC prediction and classification.

Introduction

Natural products are chemical compounds that form the basis of many pharmaceuticals and clinical therapeutics [1]. Their chemical structures are used in the development of antimicrobial drugs, anticancer therapies, and other therapeutic areas [2]. To initiate the discovery of natural products, the pharmaceutical industry has traditionally relied on laboratory research, yet this approach cannot feasibly capture the entire chemical diversity of natural products. Thus, new methods are needed to advance natural product discovery [3]. Diverse natural products can be produced in living organisms via groups of genes called biosynthetic gene clusters (BGCs). Genome mining has become a powerful tool for exploring the complex and diverse chemical space of natural products [3]. Fast, inexpensive genome sequencing technology has contributed to the advancement of BGC identification and, by extension, natural product discovery. This approach has been particularly successful in microbes, where BGCs are often a group of physically colocalized genes whose sequence and function dictates the synthesis of natural products. This discovery of BGCs supports the assembly-line enzymology model, where biosynthetic systems are multimodular and each module contains a set of domains that collectively catalyze one round of elongation and chemical modification of the growing natural product peptide chain [4]. 

Material and methods

In this section, we elaborate on details of our self-supervised deep learning framework for detecting BGCs from bacterial genomes and classifying them into their natural product classes. The workflow is summarized in Fig 1, which consists of curating data, pretraining Pfam domain embeddings, training BiGCARP, and using BiGCARP to characterize BGCs.

Results

Self-supervised training

We first developed a self-supervised training scheme to train BiGCARP to learn representations of BGCs. As BGCs have a hierarchical structure, they can be represented at four main levels. From the least-to-most granular, these are: genes, Pfam domains (families of evolutionary-related proteins), amino acids, and nucleotides. We note that more granular units of representation lead to longer sequences. BGCs typically contain several dozen genes, each of which contains one or more Pfam domains. Each Pfam domain contains tens to hundreds of amino acids, and each amino acid is encoded by three nucleotides. This introduces a trade-off between modeling short sequences where each unit is complex or modeling long sequences where each unit is simple. In order to balance input sequence length and information content of individual units, we chose to represent BGCs as sequences of Pfams. This is the same level chosen by DeepBGC [12]. 

Discussion

Biosynthetic gene clusters (BGCs) are a promising source of natural products, but are difficult to discover, express, and characterize. Recent work in self-supervised deep learning has shown promise for modeling DNA, RNA, proteins, and glycans. We develop Biosynthetic Gene Convolutional Autoencoding Representations of Proteins (BiGCARP), a masked language model that learns representations of BGCs based on their Pfam domains, detects BGCs, and predicts their product classes. To our knowledge, this is the first work to use Pfam domains as tokens in a masked language model. We demonstrate that our model learns biologically-reasonable representations of Pfam domains. Representing BGCs as Pfam domains was a compromise between limiting the sequence length while having fine-grained sequence information. Models on the level of amino acid residues or the nucleotide sequence may be able to resolve more details at the cost of more computation. BiGCARP is a strong BGC detector even without seeing negative examples, and achieves state-of-the-art accuracy in product class prediction.

Acknowledgments

This research was conducted using computational resources and services at Microsoft. We thank David Prihoda for assistance with the DeepBGC validation and test datasets and Jackson Cahn for inspiring discussions on BGCs.

Citation: Rios-Martinez C, Bhattacharya N, Amini AP, Crawford L, Yang KK (2023) Deep self-supervised learning for biosynthetic gene cluster detection and product classification. PLoS Comput Biol 19(5): e1011162. 

https://doi.org/10.1371/journal.pcbi.1011162

Editor: Shihua Zhang, Academy of Mathematics and Systems Science, Chinese Academy of Science, CHINA

Received: July 24, 2022; Accepted: May 7, 2023; Published: May 23, 2023

Copyright: © 2023 Rios-Martinez et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Data Availability: All relevant data is available on Zenodo at https://doi.org/10.5281/zenodo.6857704. Code is available at https://github.com/microsoft/protein-sequence-models and https://github.com/microsoft/bigcarp.

Funding: The author(s) received no specific funding for this work.

Competing interests: The authors have declared that no competing interests exist.

https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1011162#abstract0