By Andrea Flores Esparza and Karishma Krishna
In 2020, more than 2.5 million animals were used in scientific research in the UK.1 The use of animals in pharmaceutical development pipelines poses significant ethical, financial, and translational challenges – and there is an urgent need for technologies that facilitate efficient drug discovery whilst reducing the need for animal testing. Chemometric methods – often incorporating machine-learning algorithms (MLA) – can be used to establish quantitative structure-metabolism relationships (QMSRs) that help identify and prioritise candidate compounds at early stages of development.
There are many MLAs that enable statistical analysis of high-dimensional multivariate datasets. Some of the most common include Random Forests, Principal Component Analysis (PCA) and (Orthogonal) Partial Least Squares ((O)PLS).
Random Forests are a supervised MLA used for predictive modelling. Within a Random Forest model is a collection of decision trees, where each decision tree generates a result, and the final model prediction is the aggregate of results from each individual decision tree. For regression problems, the model prediction is the mean of the results from all decision trees.2
PCA on the other hand, is an unsupervised MLA used to reduce the dimensionality of large datasets whilst also preserving the variability of the original data.3 This is done by creating uncorrelated linear combinations of the original variables of the dataset called Principal Components. The majority of the original variable information is compressed into the first 2-3 principal components and thus they can be plotted and easily interpreted.4
PLS regression is a multivariate MLA used for predictive modelling of numeric dependent variables (Y) using a set of predictor variables (X). PLS reduces dimensionality whilst focusing on covariance by finding orthogonal latent variables that maximise the covariance between X and Y.5 OPLS is a modification of the PLS model which removes variation within the input data X that is not correlated to the dependent variable Y. This reduces complexity and improves interpretability of the model – however, this does not improve the predictive ability of the model.6
But how exactly are these mathematical models useful in drug development? In sum, these pattern-recognition methods (PRMs) facilitate the creation of predictive models that explain the metabolic pathway that these compounds tend to undertake, based on their physicochemical characteristics. They accurately handle high-dimensional data to create robust quantitative and/or qualitative predictive models using multivariate statistics.7,8 The output is hence an algorithmic model that predicts the compound’s metabolic fate before experimentation.9 The interpretation of these models requires detailed analysis as they assume that metabolic trends are solely determined by their molecular characteristics.10 Therefore, with the right judgement, candidate compounds that do not meet the desired requirements can be discarded before entering stages of in vitro and in vivo experimentation.11 Doing so ensures that only viable compounds are further developed, saving resources, money and preventing unnecessary animal suffering.
The use of these mathematical models in predicting the metabolic fate of a compound is illustrated through our current project in which we created models to predict the rate of in vitro glycine conjugation of monosubstituted benzoic acids using their molecular physicochemical properties as predictors. Each model was trained on a subset of the data, then tested using unseen data to assess the predictive ability of the model.
The generated Random Forest regression models had a poor predictive ability and were overfitted, as indicated by the metric used to evaluate the regression model, Root Mean Square Error (RMSE). A limitation of Random Forests is their inability to extrapolate, meaning the models are unable to predict values outside the range of the training data. This causes problems when the training and testing datasets have different ranges and distributions and may explain our poor model predictions.
On the other hand, the models generated by PLS and OPLS had a greater predictive ability indicated by a high goodness of prediction (Q2). Prior to conducting PLS and OPLS, we created a PCA model to describe the data as a whole. This PCA model did not only identify the way each of the different predictive variables affect the rate of glycine conjugation of the monosubstituted benzoic acids, but also identified outliers within the dataset. The outliers were the nitrobenzoic acids, suggesting that this substitution impacts the rate differently. For this reason, we chose to create two PLS and OPLS models where one included the outliers and the other one did not. To our surprise, the PLS and OPLS models that contained the outliers had a lower predictive error when calculating the RMSE. This could be explained by the fact that the testing and training datasets with and without the outliers were different to each other. When this issue was addressed, the RMSE for the models that did not include the outliers decreased. Furthermore, whilst interpreting the results it is important to bear in mind that these models assume the predictive variables are normally distributed which reduces the accuracy of the model.
This research question has also been addressed using in vivo data by B. C. Cupid and his colleagues, where they quantified the urinary excretion of the metabolic conjugates of each monosubstituted benzoic acid using rats and rabbits.9 They successfully classified these compounds according to their metabolic fate using PRM such as PCA. Doing so enabled them to establish QSMRs between the physicochemical properties and their influence on the compound’s metabolic fate.
Quantitative structure-activity relationships (QSAR) are an established component of the modern drug discovery process, with QSAR being successfully implemented as a virtual screening tool to identify compounds with activity against a certain biological target.12 As an extension of QSAR, QSMR was developed and is now also being implemented in drug discovery with published works showing successful generation of models to predict the metabolic fate of compounds.9,11,13 Our project demonstrates how predictive QSMR models can be used to investigate and hopefully further our understanding of monosubstituted benzoic acid metabolism. Going forward, the combination of QSMR models for different organic compound classes would be useful in generating a system that can predict the metabolic fate of structurally complex molecules and may even be applied to novel drugs.11
- Understanding Animal Research. Number of animals used in scientific research. https://www.understandinganimalresearch.org.uk/animals/numbers-animals/#Numbers%20of%20animals%20used%20in%202016 [Accessed 29th September 2021].
- IBM. What is random forest?. https://www.ibm.com/cloud/learn/random-forest [Accessed 29th September 2021].
- Jolliffe IT, Cadima J. Principal component analysis: a review and recent developments. Philosophical Transactions of the Royal Society A Mathematical Physical and Engineering Sciences. 2016;374(2065): 20150202. doi:10.1098/rsta.2015.0202
- Jaadi Z. A Step-by-Step Explanation of Principal Component Analysis (PCA). https://builtin.com/data-science/step-step-explanation-principal-component-analysis [Accessed 29th September 2021].
- Wold S, Sjöström M, Eriksson L. PLS-regression: a basic tool of chemometrics. Chemometrics and Intelligent Laboratory Systems. 2001;58(2): 109-130. https://doi.org/10.1016/S0169-7439(01)00155-1
- Trygg J, Wold S. Orthogonal projections to latent structures (O-PLS). Journal of Chemometrics. 2002;16(3): 119-128. https://doi.org/10.1002/cem.695
- Kwon S, Bae H, Jo J, Yoon S. Comprehensive ensemble in QSAR prediction for drug discovery. BMC Bioinformatics. 2019;20(1): 521. https://doi.org/10.1186/s12859-019-3135-4
- Svetnik V, Liaw A, Tong C, Culberson JC, Sheridan RP, Feuston BP. Random Forest: A Classification and Regression Tool for Compound Classification and QSAR Modeling. Journal of Chemical Information and Computer Sciences. 2003;43(6): 1947-1958. https://doi.org/10.1021/ci034160g
- Cupid BC. Quantitative structure-metabolism relationships (QSMR) using computational chemistry: pattern recognition analysis and statistical prediction of phase II conjugation reactions of substituted benzoic acids in the rat. Xenobiotica. 1999;29(1): 27-42. doi:10.1080/004982599238795
- Yoo C, Shahlaei M. The applications of PCA in QSAR studies: A case study on CCR5 antagonists. Chemical Biology & Drug Design. 2017;91(1): 137-152. https://doi.org/10.1111/cbdd.13064
- Athersuch TJ, Wilson ID, Keun HC, Lindon JC. Development of quantitative structure-metabolism (QSMR) relationships for substituted anilines based on computational chemistry. Xenobiotica. 2013;43(9): 792-802. https://doi.org/10.3109/00498254.2013.767953
- Neves BJ, Braga RC, Melo-Filho CC, Moreira-Filho JT, Muratov EN, Andrade CH. QSAR-Based Virtual Screening: Advances and Applications in Drug Discovery. Frontiers in Pharmacology. 2018;9: 1275. https://doi.org/10.3389/fphar.2018.01275
- Cupid BC, Beddell CR, Lindon JC, Wilson ID, Nicholson JK. Quantitative structure-metabolism relationships for substituted benzoic acids in the rabbit: prediction of urinary excretion of glycine and glucuronide conjugates. Xenobiotica. 1996;26(2): 157-176. https://doi.org/10.3109/00498259609046697