Bioactivity assessment of natural compounds using machine learning models trained on target similarity between drugs
Vinita Periwal, Stefan Bassler, Sergej Andrejev, Natalia Gabrielli, Kaustubh Rao saheb Patil, Athanasios Typas, Kiran Rao saheb Patil
Abstract
Natural compounds constitute a rich resource of potential small molecule therapeutics. While experimental access to this resource is limited due to its vast diversity and difficulties in systematic purification, computational assessment of structural similarity with known therapeutic molecules offers a scalable approach. Here, we assessed functional similarity between natural compounds and approved drugs by combining multiple chemical similarity metrics and physicochemical properties using a machine-learning approach. We computed pairwise similarities between 1410 drugs for training classification models and used the drugs shared protein targets as class labels. The best performing models were random forest which gave an average area under the ROC of 0.9, Matthews correlation coefficient of 0.35, and F1 score of 0.33, suggesting that it captured the structure-activity relation well. The models were then used to predict protein targets of circa 11k natural compounds by comparing them with the drugs. This revealed therapeutic potential of several natural compounds, including those with support from previously published sources as well as those hitherto unexplored. We experimentally validated one of the predicted pair’s activities, viz., Cox-1 inhibition by 5-methoxysalicylic acid, a molecule commonly found in tea, herbs and spices. In contrast, another natural compound, 4-isopropylbenzoic acid, with the highest similarity score when considering most weighted similarity metric but not picked by our models, did not inhibit Cox-1. Our results demonstrate the utility of a machine-learning approach combining multiple chemical features for uncovering protein binding potential of natural compounds.
Introduction
Around 65% of the small-molecule drugs in use today have originated from natural compounds or their derivatives [1]. Therapeutic effects of natural compounds are thus central to drug discovery [2–5]. Further, identification of bioactive compounds present in the diet and their effect on health has been an active area of research since long [6,7]. A number of recent studies have reported that dietary natural compounds (such as polyphenols, alkaloids) can reduce the risk of many chronic diseases [8–10], lead to drug and food interactions (occurs when your food and medicine interfere with one another) [11–13], and significantly alter or diversify the composition of the human gut microbiome [14–17]. While natural compounds possess rich structural diversity, often have selective biological actions, and are prevalidated on various biological targets by evolutionary selection [18–21], they are generally less accessible in pure form than synthetic compounds. This is primarily due to their low abundance in natural sources and complex purification methods [22,23]. Recent technological advances in analytical methods such as metabolomics, metabolic engineering, and synthetic biology, as well as those in functional assays and phenotypic screens are opening new opportunities for natural compound-based drug discovery [2,22,24]. Increasing number of computational tools [25,26], techniques [27,28], and databases [29] are providing more accessible and powerful alternatives to explore the therapeutic potential of natural compounds.
Materials and methods
Data source and processing
All the FDA approved drugs which had target information (S1A and S1B Table) associated with them were taken from DrugBank [51] (accessed January 2018). The natural compound library used for virtual screening was obtained from FooDB (www.foodb.ca; freely available and accessed June 2017, (~11k compounds). It was curated and formatted to be smoothly integrated into our analysis. The full list of compounds with their annotations is provided in S1C Table. It included compounds from both raw and processed foods. We used drug classification codes from ATC (https://www.whocc.no/) to therapeutically classify all the drugs and ClassyFire [50] to structurally classify the drugs and the natural compounds.
Results
Dataset of drugs with known targets
We utilized mappings between 1,410 FDA approved drugs (S1A Table) and their known, curated, targets (S1B Table) as our gold-standard dataset. The drugs were categorized according to their ATC (Anatomical Therapeutic Chemical) class, and into 16 chemical Superclasses (a hierarchy in chemical taxonomy with general structural identifiers such as organic acids and derivatives, organometallic compounds) [50] based on their chemical structures (Materials and methods). Many of the drugs target the nervous system (264), followed by cardiovascular (180), anti-infectives (148), multiple ATC (131) and anti-neoplastic (127) (S1A Fig). Among the 16 structural classes, benzenoids and organoheterocyclics constitute the major super-classes of drugs (840) encompassing all therapeutic classes except the nutraceuticals (S1A Fig).
Discussion and conclusion
This study has addressed three goals: identification of potential molecular targets of ingested natural compounds and exploring their therapeutic potential; evaluating the utility of a comprehensive ML based approach to deconvolute the complex SAR between molecules as opposed to restricting to a single similarity measure; and, lastly, complementing in-silico model predictions with experimental validation to build trust in model’s predictions.
Systematically integrating computational chemistry approaches can help deconvolute the intricate structure-activity relationship between small molecules and their biological targets. The data fusion approach–i.e., integrating multiple similarity metrics based on fingerprints, maximum common substructures, and physicochemical descriptors–used in this study proved effective in identifying natural compounds functionally similar to known therapeutic drugs. The RF models achieved a good performance with an average AUC of 0.9. Analysis of the curated 200 drug-food pairs predicted from the models helped to capture drug analogs, host endogenous metabolites, some investigational drugs, as well as novel molecular leads present in various food sources which are deemed to share the same target as the drugs.
Citation: Periwal V, Bassler S, Andrejev S, Gabrielli N, Patil KR, Typas A, et al. (2022) Bioactivity assessment of natural compounds using machine learning models trained on target similarity between drugs. PLoSComputBiol 18(4): e1010029. https://doi.org/10.1371/journal.pcbi.1010029
Editor: Costas D. Maranas, The Pennsylvania State University, UNITED STATES
Received: September 6, 2021; Accepted: March 17, 2022; Published: April 25, 2022
Copyright: © 2022 Periwal et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Data Availability: The input data files can be accessed at (https://data.mendeley.com/datasets/7ft539gwf3/3) and codes for analysis and figure generation are available at https://github.com/periwal45/periwaletal2020.
Funding: VP and NG were supported by the EMBL Interdisciplinary Postdoc (EI3POD) program under Marie Skłodowska-Curie actions COFUND (grant number 664726). SB was supported by the Joachim Hertz Foundation (fellowship for Interdisciplinary Science). KRP and VP received support from the UK Medical Research Council (grant number MC_UU_00025/11). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
Competing interests: The authors have declared that no competing interests exist
https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1010029#abstract0