Bio-activity Prediction of Drug Candidate Compounds Targeting SARS-Cov-2 Using Machine Learning Approaches
Faisal Bin Ashraf, Sanjida Akter, Sumona Hoque Mumu, Muhammad Usama Islam, Jasim Uddin
Abstract
The SARS-CoV-2 3CLpro protein is one of the key therapeutic targets of interest for COVID-19 due to its critical role in viral replication, various high-quality protein crystal structures, and as a basis for computationally screening for compounds with improved inhibitory activity, bioavailability, and ADMETox properties. The ChEMBL and PubChem database contains experimental data from screening small molecules against SARS-CoV-2 3CLpro, which expands the opportunity to learn the pattern and design a computational model that can predict the potency of any drug compound against coronavirus before in-vitro and in-vivo testing. In this study, Utilizing several descriptors, we evaluated 27 machine learning classifiers. We also developed a neural network model that can correctly identify bioactive and inactive chemicals with 91% accuracy, on CheMBL data and 93% accuracy on combined data on both CheMBL and Pubchem. The F1-score for inactive and active compounds was 93% and 94%, respectively. SHAP (SHapley Additive exPlanations) on XGB classifier to find important fingerprints from the PaDEL descriptors for this task. The results indicated that the PaDEL descriptors were effective in predicting bioactivity, the proposed neural network design was efficient, and the Explanatory factor through SHAP correctly identified the important fingertips. In addition, we validated the effectiveness of our proposed model using a large dataset encompassing over 100,000 molecules.
Introduction
Recent epidemic outbreaks have emphasized the importance of establishing affordable cost treatments. Discovering new tiny molecules known as ligands along with substantial bioactive components against target proteins, also known as receptors, is an important step in early drug design. A substance’s bioactivity, which reflects its potency and ability to have a biological effect, is critical to its pharmacological effects. Classically, promising compounds are screened using low or high throughput experimental bioassays; however, these approaches are costly and time-consuming, rendering them unsustainable for large molecules like proteins. Computational approaches have made significant strides in accurately and efficiently predicting the biological activity of both small and large molecules to overcome these challenges. This has resulted in the development of competitive inhibitors, which are considered to be bioactive small molecules with a specific binding affinity but can also be subsequently experimentally evaluated. The COVID-19 pandemic has increased the demand for new antiviral medications or therapies. One of the most significant challenges is the time required to finalize the chemicals for vaccine formulation, which can stymie vaccine development and have serious consequences. Although several trials for many pharmaceutical companies have been successful, using artificial intelligence to predict potential chemicals for vaccine formulation could significantly speed up the process and save lives.
Materials and methods
3.1 Data collection and curation
We have collected data for bioactivity prediction from two different database. First one is ChEMBL by [29] database where a number of experimental results are stored against SARS-CoV-2. The second one is BioAssay data from PubChem database [30, 31] that contains around 300,000 compounds activity against SARS-Cov-2 3CLpro..
Results
In this study, we looked at how well ensemble classifiers and conventional machine learning performed in predicting the bioactivity of several drugs that target coronavirus. To find the descriptor that is most appropriate for the task, we developed two different molecular descriptors for this experiment: Lipinski and PaDEL. There were two phases to the experiment. First, we used 26 conventional and ensemble classifiers independently to each of the two descriptors. From this experiments we will understand which descriptors are more suited for this task. After that, we created and put to the test a neural network architecture for this classification job using the more suitable chemical descriptors.
Discussion
The present study utilized classification and ensemble methods on a carefully curated dataset. The results indicated that the PaDEL descriptors were effective in predicting bioactivity, and the proposed neural network design was efficient. Additionally, support vector-based classifiers, due to their advantage in handling high-dimensional data and limited number of samples, demonstrated good performance. The neural network models were able to identify the key features contributing to classification through in-depth learning, resulting in high accuracy in predicting the bioactivity class. To prevent overfitting, the neural network architecture was equipped with a dropout layer and synthetic data was used in training. Recently, machine learning techniques have been utilized to find potential drugs against SARS-CoV-2. Besides generating effective molecular descriptors for coronaviruses, ML is also used to predict the bioactivity of existing drugs. The use of ML algorithms has been limited to in-vitro and in-vivo experiments in a few studies. A study by [26] showed that the SVM classification algorithm had an accuracy of 88%. However, our method performed better, with a 93% accuracy rate. Additionally, we investigated multiple molecular description methods and identified important molecular substructure that impacts the bioactivity.
Conclusion
It is evident that the drug development is a costly and time-consuming process. Through using the proper molecular descriptors and the capabilities of machine learning techniques this process could well be expedited efficiently. The bioactivity of a drug against the SARS-CoV-2 3CLpro protein was determined in this study using classic machine learning and ensemble approaches. The study examined the Lipinski and PaDEL molecular descriptors to determine whether one is more appropriate for such classification activity. Subsequently, this study also present an efficient neural network model which surpassed the competition with just prediction performance of 93%. Our model was trained on a large dataset that was carefully curated and collected from two different data sources. The proposed model was trained in a large dataset that has been vetted and assembled from two distinct data sources. In addition, the trained model was then applied to a list of 1186 candidate compounds of which 486 identified active inhibitors. The model identified 456 compounds as active, which were then evaluated for further activities against SARS-CoV-2 using REDIAL. Therefore, it is demonstrated in our approach can effectively determine the undiscovered active substance.
Citation: Ashraf FB, Akter S, Mumu SH, Islam MU, Uddin J (2023) Bio-activity prediction of drug candidate compounds targeting SARS-Cov-2 using machine learning approaches. PLoS ONE 18(9): e0288053. https://doi.org/10.1371/journal.pone.0288053
Editor: Ahmed A. Al-Karmalawy, Ahram Canadian University, EGYPT
Received: February 22, 2023; Accepted: June 18, 2023; Published: September 5, 2023
Copyright: © 2023 Ashraf et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Data Availability: All relevant data, data extraction, and coding are shared at the link: (https://github.com/fbabd/bioactivity-against-SARS-CoV-2).
Funding: The authors received no specific funding for this work.
Competing interests: NO - The authors have declared that no competing interests exist.
https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0288053#abstract0