Ensemble learning application to discover new trypanothione synthetase inhibitors

Trypanosomatid-caused diseases are among the neglected infectious diseases with the highest disease burden, affecting about 27 million people worldwide and, in particular, socio-economically vulnerable populations. Trypanothione synthetase (TryS) is considered one of the most attractive drug targets within the thiol-polyamine metabolism of typanosomatids, being unique, essential and druggable. Here, we have compiled a dataset of 401 T. brucei TryS inhibitors that includes compounds with inhibitory data reported in the literature, but also in-house acquired data. QSAR classifiers were derived and validated from such dataset, using publicly available and open-source software, thus assuring the portability of the obtained models. The performance and robustness of the resulting models were substantially improved through ensemble learning. The performance of the individual models and the model ensembles was further assessed through retrospective virtual screening campaigns. At last, as an application example, the chosen model-ensemble has been applied in a prospective virtual screening campaign on DrugBank 5.1.6 compound library. All the in-house scripts used in this study are available on request, whereas the dataset has been included as supplementary material.


Introduction
Protozoa from the trypanosomatid lineage (which include leishmaniasis, Chagas disease and African sleeping sickness) affect more than 27 million people worldwide and are the etiologic agents of some of the neglected diseases with the highest disease burden [1]. Their treatments are exclusively dependent on chemotherapy since no vaccines are available. They primarily affect populations living in conditions of socio-economic vulnerability and limited access to healthcare facilities. For example, only 10% of the 6-7 million people suffering from Chagas disease are diagnosed, and less than 1% have access to treatment [2]. The available treatment relies on only two drugs, benznidazole and nifurtimox, which were developed more than 50 years ago. Both drugs are primarily used for treating acute cases, as their efficacy in the chronic stage of the disease is limited, or at least controversial [3][4][5]. Altogether, this highlights the historical lack of investment in novel therapeutic solutions to fight neglected conditions.
The thiol-polyamine metabolism of trypanosomatids has long been regarded as a suitable pharmacological target because of its indispensable role in redox homeostasis and the uniqueness of several of its molecular components [6]. In these organisms, the low-molecular mass thiol-polyamine conjugate trypanothione (bisglutathionyl spermidine) [7] takes over most redox functions carried out by glutathione in other living organisms [8,9] ( Fig. 1). Trypanothione and the enzyme in charge of its synthesis, namely trypanothione synthetase (TryS), are absent in mammals, which turns the latter an attractive drug target for the development of selective therapeutic agents against trypanosomatid-caused conditions. In this regard, the essentiality and druggability of TryS have been genetically and chemically verified in the major pathogenic species [10][11][12]. Moreover, metabolic control analysis positions TryS among the pathway components of the trypanothione-dependent redox system most vulnerable to therapeutic intervention [13]. This study predicted that a reduction of 50% in the trypanothione synthesis flux can be achieved by partial (63%) inhibition of TryS.
TryS inhibitors have been reported using experimental approaches that included the development of substrate-like and mechanistic-based inhibitors [14][15][16] as well as small to large random or scaffold-focused screening campaigns [17,18]. So far, there are no reports on the identification of TryS ligands/inhibitors using computational or structure-based strategies. Both approaches are commonly applied in the field of medicinal chemistry and represent powerful tools that may speed-up and increase the chances of identifying novel molecular entities with drug-like properties and TryS inhibitory activity. A consistent in silico screening strategy demands the development of robust analytical tools to mine chemical libraries and predict with high confidence the inhibitory potential of a molecule against a specific molecular target.  1 Schematic representation of the low-molecular mass thioldependent redox systems of trypanosomatids and mammals. a The low-molecular mass thiol-dependent redox system of trypanosomatids depends on bisglutathionylspermidine (trypanothione), which is stepwise synthesized by trypanothione synthetase (TryS) at the expense of ATP and the substrates glutathione (GSH), spermidine (Sp) and monoglutathionylspermidine (Gsp). The NADPH + -dependent enzyme trypanothione reductase (TR) recycles reduced trypanothione (T(SH) 2 ) from its oxidized/disulfide form (TS2). The oxidoreductase tryparedoxin (TXN) and, to a minor extent, glutaredoxin (Grx) receive reducing equivalents from T(SH) 2 . Reduced TXN and Grx catalyze the reduction of several protein and non-protein homo-or hetero-disulfides (XS 2 and XS-SX´) that fulfil different and indispensable functions in the cells. b Mammals lacks TryS and, hence, trypanothione, and their low-molecular mass thiol-dependent redox system relies on the use of the tripeptide glutathione. Glutathione is maintained in its reduced form (GSH) by the action of the NADPH + -dependent enzyme glutathione reductase (GR), which is absent in trypanosomatids. Grx uses GSH as redox cofactor to catalyze the oxidoreduction of different protein targets engaged in the regulation of important cellular processes Thus, as a first step toward the identification of new TryS inhibitors, we report the development and validation of TryS ligand-based models. The individual models were subsequently combined through ensemble learning to improve their robustness and predictive ability. Noteworthy, the models were inferred from a dataset combining compounds with activity data retrieved from the literature and from results of TryS assays measured in house. A prospective screening campaign on the DrugBank database was performed applying the model ensemble that showed the best performance.

Dataset collection
A dataset of 401 compounds previously tested against Trypanosma brucei TryS was collected: One hundred and fortyone of such compounds were retrieved from the literature [10,12,[14][15][16][17][18][19][20][21][22], whereas the data on the remaining 260 compounds were acquired in house, as described earlier [23]. Standardized molecular representations of the dataset compounds were initially obtained using Instant Jchem's Standardizer (v 16.10.10.0) (ChemAxon), applying the following commands: Clean 2D, Aromatize, Strip salts, Clear stereo, Remove absolute stereo, Remove solvents, and Add explicit hydrogens. The compounds were flagged as either ACTIVE or INACTIVE based on their experimentally observed inhibitory data. Those that inhibited TryS by more than 50% at concentrations ≤ 30 µM were assigned to the ACTIVE class; if not, they were assigned to the INACTIVE class. Considering such criterion, the dataset comprised 111 active compounds and 290 inactive compounds. The diversity of the entire dataset and within each class is graphically displayed in the heatmap in Fig. 2

Molecular descriptors calculation
A set containing 66,128 conformation-independent descriptors of the TryS datasets was calculated using various freely available and open-source software packages, as described below, with compounds provided in MDL sdf format.

Molecular descriptors pruning
The 66,128 molecular descriptors computed were analyzed to eliminate collinear descriptors with redundant information. Thus, highly correlated descriptor pairs were identified (descriptors pairs with maximum squared correlation coefficient R 2 max ij = 1 ), and only the most interpretable variable from each pair was maintained for further analysis. Moreover, descriptors with scarce information content (those with constant and near-constant values) and descriptors with missing values were also eliminated. This procedure led to a final descriptor pool comprising 44,209 linearly independent descriptors.

Dataset partition
The molecular set was partitioned into training, validation, and test sets by means of the Balanced Subsets Method (BSM) [29]. The training set (train) includes compounds that are used for model calibration purposes, the validation set (val) comprises instances used for internal model validation, while the test set (test) comprises compounds completely independent from the calibration procedure, which help to assess the true predictive power of the obtained QSAR.
The BSM approach is a sampling procedure that was developed by our group to ensure that balanced, representative subsets are derived from the dataset, in such a way that training, validation and test instances are not chosen randomly but provide similar structure-activity relationships within each subset. It is based on k-means cluster analysis (k-MCA) [30]: in essence, k-MCA creates k groups of compounds in such manner that the compounds within a given cluster are very similar in terms of a distance metric (here, Euclidean distance), whereas compounds in different clusters are very dissimilar.

Variable subset selection
The Replacement Method (RM) [31] has been devised in our group as an efficient variable selection approach to build multiple linear regression models from training samples, by searching for a subset comprising d descriptors through a large pool of D descriptors (d being much lower than D). In the current study, the RM technique has been modified in order to be applicable to classification problems. Therefore, the highest value of the non-error rate (NER) or average sensitivity [32] has been searched in the training set. The quality of the models built by this method is close to that of the models obtained through a full search of all the possible combinations of d independent variables from a pool of D variables.
The fundamental principle underlying the RM is that one can approach the maximum of NER train by sensibly considering the relative errors of the coefficients of the least-squares model provided by the subset of d descriptors. This is to say, the global maximum of NER train (d) will be pursued in a All the Octave [33] programmed algorithms used for this study have been developed in-house and are available to the reader upon request.

Retrospective screening campaigns
By conducting thorough simulated ranking experiments, Truchon and Bayly [34] have formerly proven that the area under the receiver operating characteristic curve (AUC ROC), a commonly used enrichment metric to evaluate and compare the performance of virtual screening methods, depends on the proportion of active compounds in the screened library, and that the standard deviation of this metric tends to a constant value when a small proportion of active compounds (or yield of active compounds, Ya) is present in the screened library (Ya below 0.05 seems to provide robust results). What is more, a small Ya avoids or at least limits the saturation effect. Accordingly, a large number of decoys (about 1,000 or above) and a small Ya contribute to a controlled statistical behavior.
Therefore, a retrospective virtual screening experiment was implemented for further validation of the individual classifiers (which were built as described in the preceding subsections). A library for such retrospective screening campaign, which will be termed DUDE-1, was obtained by seeding 18 known active compounds among 913 decoys retrieved from the Directory of Useful Decoys Enhanced [35]), a broadly used benchmarking application used to obtain putative inactive compounds. Besides assessing the performance of the individual classifiers, DUDE-1 was used to train classifier ensembles (see next subsection). A second retrospective screening library (DUDE-2), obtained in a similar manner than DUDE-1 and comprising 18 active compounds and 872 decoys was utilized to independently estimate the performance of the best model ensemble.

Ensemble learning
Generally speaking, it is recognized that classifier ensembles (meta-classifiers) tend to enhance the predictivity and robustness in comparison with individual (single model) classifiers [36,37] and it may be useful to mitigate the impact of noisy data [38].
In the present study, the best individual classifiers were hence systematically combined using the AUC ROC in the first retrospective screen (DUDE-1) as criterion of performance. To choose the optimal number of individual classifiers to be included in the ensemble, the performance of systematic combinations of the 2 to 10 best-performing individual classifiers was assessed (the two best-performing classifiers were assembled, then the three best-performing classifiers, and so on up to a total of 10 classifiers). Four schemes have been used to get a combined score: Average Score (AVE); Average Ranking (RANK), MIN operator (which considers the minimum score across the scores of the individual classifiers included in an ensemble) and Average Voting (VOT). Voting was computed as formerly described by Zhang and Muegge [39]. AUC ROCs were computed with the pROC package [40] and Delong's method was applied for the statistical comparison of the AUC ROC using the open-source application Rocker [41]. BEDROC (alpha = 20) and the enrichment factor in the top-ranked 1% of the libraries (EF1%) were also computed.

Use of positive predictive value surfaces to guide the choice of a score cut-off value
A practical concern in any virtual screening campaign relates to estimating the probability that a predicted in silico hit will corroborate its activity when subject to experimental testing (such probability is termed Positive Predictive Value, PPV, also known as precision). Still, prospective estimation of such probability is not feasible owing to its dependency on the yield of active compounds of the screened library, which is a priori unknown in a prospective (real) virtual screening experiment: where Se symbolizes the sensitivity associated to a given score cut-off value, and Sp denotes the specificity. As in previous reports (see, for instance, [42] and [43]), we applied Eq. (1) to build PPV surfaces. In order to decide on an optimal score cut-off value to later select hits in prospective virtual screening experiments, 3D plots displaying the interplay between the Se/Sp ratio, Ya and PPV were generated for the best individual classifier and for every classifier ensemble. Using the library built for the retrospective screening experiment, Se and Sp were calculated in all the range of possible score cut-off values. Because controlled statistical behavior has been observed for libraries of about 1,000 compounds or more and Ya below 0.05, we may reasonably assume that the ROC curve and its derived metrics will be similar when applying the classifiers or meta-classifiers to screen other chemical libraries with similar features. Bearing in mind that in prospective virtual screening experiments Ya is ignored beforehand but always low, the hypothetical yield of active compounds has been varied between 0.1% and 1%. PPV graphs were obtained via the R package plotly. PPV surfaces were visually inspected to select a score cut-off associated to the required PPV range.

Prospective virtual screening
The best model ensemble was applied to screen 10,759 compounds from DrugBank 5.1.6, an online database containing extensive information about approved, withdrawn, experimental and investigational small drugs and biologics, as well as nutraceuticals and illicit drugs [44]. DrugBank is commonly utilized for computer-guided drug repurposing campaigns. The screened compounds were pre-processed using Standardizer 16.10.10.0 (ChemAxon). The following actions were applied to obtain standardized representations of the molecular structure for the subsequent in silico screen: Whether each screened compound belonged or not to the applicability domain of the models included in the best ensemble was estimated through the extent of extrapolation approach [45], with the warning leverage fixed at 2 k/n, k being the number of descriptors in the model and n being the number of training set examples.

Results and discussion
The BSM procedure led to a dataset partition with 74:73:254 representative compounds in the training, validation, and test sets, respectively. The training and validation sets were sampled in such a way that they contain the same number of active and inactive compounds. Thus, active compounds were in a 37:37:37 proportion, while inactive compounds are in a 37:36:217 proportion.
Equation (2) Table 1 includes the classification results for the best models established in the present QSAR study through the RM technique; here, different classification measures are reported: the non-error rate (NER), accuracy (Acc) and the Matthews correlation coefficient (MCC).
The best model is highlighted in Table 1 (M7), which had the best performance in the validation set for the NER parameter. This QSAR classification model involves five conformation-independent molecular descriptors selected out of 44,209 variables: two fragment descriptors: Frag43, the count of C-N-C*C*C-Cl fragments (where * denotes aromatic), and Frag295, the count of C*N*N-C-C-N fragments; a 2D-autocorrelation descriptor: ATSC5p, the centered Broto-Moreau autocorrelation-lag 5/weighted by polarizabilities; and two algebraic MD-LOVIs indices:
The ATSC5p descriptor is an autocorrelation function between selected atom pairs of the molecule, with the main purpose of capturing the degree of interaction between atoms at a given topological distance (lag). The contributions to this bidimensional autocorrelation are obtained by considering at lag 5 the atomic polarizabilities involved.
The two MD-LOVIs local descriptors represent an alternative strategy that generalizes the traditional method of obtaining a topological descriptor by summation of the local vertex invariants. The first one is 1CN-NE(DE)-E-WH-HT, which describes heteroatoms, and the second one is 2TS-NE(DE)-R-WH-GL_1_2_3_4_5_6_7_8, describing group lags 1-8. The hydrogen-filled graph is used as molecular representation in both cases. The Pauling's electronegativity (E) and the covalent radius (R) are used, respectively, as atomic weights for characterizing the nature of atoms. For the first descriptor, the Kier-Hall connectivity of order 1 (1CN) operator is first applied on the atomic weights of the heteroatoms, while for the second descriptor the total sum of 2 lags (2TS) operator is applied. The standard deviation is finally used as aggregation operator of these no-standardized LOVIs. Noteworthy, all the regression coefficients are above zero (positive terms), indicating that all the features described above directly contribute to the inhibitory activity against TryS. For instance, the counts of C-N-C*C*C-Cl and C*N*N-C-C-N directly correlate with inhibitory activity, indicating that the presence of aromatic heterocycles containing nitrogen atoms (e.g., pyridine or pyrrole rings) are favorable to activity. Similarly, inhibitory activity is seemingly potentiated by the present of atoms with relatively high product of their correspondent atomic polarizability at a topological distance of five, as suggested by ATSC5p. Table 2 shows the AUC ROC of the single best-performing model and the model ensembles obtained by systematically combining 2-10 models using the AVE, RANK, MIN and VOT operators (first retrospective virtual screening, DUDE-1). The standard deviation estimated by bootstrapping is also included. As expected, the performance of the model ensembles is, in all cases, significantly better (p < 0.05) than the one of the best individual model. This seems to confirm the ability of ensemble learning to enhance predictivity. On the other hand, the robustness is also apparently improved through ensemble learning, as judged from the comparatively lower standard deviations of the best-performing ensembles compared with the one of the best individual model.
By jointly considering the performance (based on the AUC ROC metric), the standard deviation of such metric and the principle of parsimony, we chose the 4-model ensembles for further validation (note that these ensembles behave practically identically or, in some cases, better than the 5-. 6-and 7-model ensembles, but with a smaller number of combined models). The principle of parsimony or Occam's razor indicates that, given two solutions of similar performance, the simplest one should be kept.
After selecting the 4-model ensembles for further analysis, we relied on PPV surface analysis to decide on an optimal score threshold for forthcoming prospective virtual screening applications. With the support of PPV surfaces, the relationship between the Se/Sp ratio and the PPV or precision (i.e., the probability that the predicted activity of an in silico hit will be verified when subject to experimental validation) can be visually (or, eventually, mathematically) optimized across a pertinent range of Ya values. Note that, from a pragmatical viewpoint, PPV is possibly the most relevant metric when realizing virtual screening experiments, as it allows deciding what number of in silico hits should be acquired and tested to expect one true, verified experimental hit. For such analysis, we have considered that the association between the Se/Sp and the score values of the model ensembles observed in the first retrospective screen would remain approximately unmodified when conducting screens on other libraries. This strong assumption is not necessarily Table 2 Performance of the best individual classifier and the classifier ensembles in the first retrospective screening experiment (DUDE-1) The AUC ROC and its correspondent standard deviation (sd) are shown for the best individual model and for each model ensemble. The 4-model ensembles (in bold) displayed, in general, the best performance (highest AUC ROC with comparatively low sd). The 2-to 7-model ensembles showed excellent performance (AUC ROC above 0.99) and low dispersion (sd < 0.010), no matter which combination scheme was utilized. Remarkably, the performance dropped from the combination of 8 models and beyond, with a considerable raise in the standard deviation of the metric. true. Since the AUC ROC values obtained for the retrospective screen are, however, undeniably high (above 0.99, which indicates a nearly ideal performance), while on the other hand, the yield of actives (0.02) and size (≈1000 molecules) of the library favor a controlled statistical behavior [33], our assumption is sensible in the present setting (also note the extremely low standard deviations obtained by bootstrapping for the ensemble models). The resulting PPV surfaces are shown in Fig. 3. All the combination schemes resulted in similar surfaces (similar shape and height) except for AVE, which seems to provide smaller PPVs than the other operators. The 4-model ensembles were further validated through our second retrospective virtual screening experiment (DUDE-2). Table 3 shows the correspondent results, in comparison with the best-performing individual model. Note that, in good agreement with previous reports [42,43,46], the rather conservative MIN operator (which assigns as the ensemble output the lowest score given by the individual models comprised by the ensemble) seems to exhibit the best and most robust performance, either in terms of average performance across the whole ranking (as reflected by AUC ROC) or in terms of early recognition (quantified by BEDROC and EF1%). The comparisons of the AUC ROCs for the best individual model and the 4-model ensembles for both the first and second retrospective screen are shown in Fig. 4. Table 4 displays the optimal score cut-off value selected for the 4-model ensembles for each combination scheme, and the associated PPV (Ya = 0.01) obtained from the second retrospective screen. It can be observed that the PPV associated to the optimal cut-off value for the 4-model ensemble obtained with the MIN operator is higher than the ones achieved with the other schemes. These theoretic values suggest that, if a prospective screening were implemented and the yielding of active compounds was 1%, about one in three in silico hits would be a true, confirmed hits at the experimental validation step. Table 5 shows the ten top-ranked in silico hits emerging from the prospective virtual screen of the DrugBank database. It can be appreciated that the compounds in the

Conclusions
Linear classifiers to identify TryS inhibitors were derived using the Replacement Method and a pool of molecular descriptors entirely computed through publicly available and open-source software, which maximizes the portability of the obtained models (furthermore, all the in-house scripts used are available on request). Remarkably, a substantial proportion of the dataset used here (65%) corresponds to highly standardized in-house acquired data of inhibitory activity against T. brucei TryS, which can be regarded as low noise data, Whereas the individual models obtained showed good performance at the validation and retrospective screening steps, they were considerably outperformed by ensemble learning. The so derived meta-classifiers displayed not only improved enrichment metrics, but also a   more robust behavior according to the standard deviation in the enrichment metrics estimated by bootstrapping. The score cut-off values of the best-performing model combinations were rationally optimized though inspection of Positive Predictive Value surfaces; the optimized score threshold was then applied here in the illustrative prospective screening of DrugBank, a database commonly used for computer-aided drug repurposing implementations. It is worth mentioning that out models are only predictive of inhibitory effects on TryS and should thus be complemented with other in silico filters or in vitro assays related to pharmaceutically relevant properties, such as drug bioavailability (e.g., Lipinski rules, models to predict P-glycoprotein efflux liability, etc.). Computer-aided drug discovery represents a key strategy for the identification of new active scaffolds in a cost-and time-efficient manner, which is particularly relevant when seeking for novel therapeutic solutions for neglected conditions, such as trypanosomatid-caused diseases.