Application of a novel ranking approach in QSPR-QSAR

In this study we present a simple algorithm based on the Partial Order Ranking (POR) technique which allows to rank a series of compounds according to their molecular descriptor values. A training set composed of 82 normal boiling points for structurally diverse organic compounds is analyzed by considering a pool of 1202 molecular descriptors obtained from the Dragon 5 software and two “ﬂexible” type of variables. The predictive performance of the proposed approach is assessed by means of a test set of 82 “unknown” structurally related molecules.


Introduction
The prediction of physicochemical and biological data of substances through the application of Quantitative Structure Property-Activity Relationships Theory (QSPR-QSAR) has acquired an increasing importance in the last decades.This is specially so when the experimental values of an endpoint can not be determined in the laboratory due to several circumstances, such as economical reasons or simply because the measurements demand too much time.The QSPR-QSAR studies are considered to be the most effective computational approaches for the estimation of different type of properties [1].
Although there is a great number of definitions for molecular descriptors available in the literature, it is well known that a single variable is unable to carry all the information on molecular structure, and this leads to the employment of more parameters in the QSPR-QSAR relationship.Nowadays, different standard statistical methods constitute a common practice for the model design, such as Multivariable Linear Regression (MLR) [2], Principal Component Analysis (PCA) [3], Partial Least Squares (PLS) [4], Genetics Algorithms [5][6][7] or Artificial Neural Networks (ANN) [8].However, all of these elaborated techniques require the knowledge of a specific functional form of the model (linear or non-linear) and also optimized regression parameters to be present in the equation which, however, may not lead to the best results.QSPR-QSAR studies are usually based on such complex statistical analyzes and sophisticated local and global descriptor definitions.
The Partial Order Ranking (POR) approach provides with an interesting alternative and simplified approach to establish the desired structure-property connection, as it does not depend on Statistics.It represents a parameter-free technique that avoids the definition of analytical functional relationships and the use of optimized parameters.A consequence of this is that it also constitutes an obvious advantage to remedy the lack of availability of experimental data, one of the main drawbacks in statistical procedures.
The aim of this research is to introduce a new algorithm developed recently in our group that considers different sorting ideas.The development of an efficient and practical technique for performing the best quality predictions of a given endpoint is not an easy task.Many programs have to be written to test things and to decide whether the searched algorithm tend to reproduce the "tendencies" in the numerical data.For the case of ranking-based algorithms, this design involves two main objectives: (a) locating the upper and lower limits of the experimental property interval within which a compound is to be predicted.(b) performing the prediction in the selected interval in the best possible manner by resorting to interpolation formulae.
Clearly, if the algorithm is unable to position a compound X in an adequate interval, then the predictions performed in (b) will result of poor quality.
Step (a) is a key step.A good interval can be understood as one that is able to position compounds from both the training and test sets with accuracy.The smaller the length of the interval is, the better the predictions performed will be in (b).The present study concerns with step (a) by using one or more molecular descriptors for ranking.

Molecular descriptors and data set
All the structures of the compounds were preoptimized by means of the Molecular Mechanics Force Field (MM+) included in Hyperchem version 6.03 [9].Since various molecules contain sulfur atoms, final refined molecular structures were obtained using the semiempirical method PM3 (Parametric Method-3).We chose a gradient norm limit of 0.01 kcal/Å for the geometry optimization.
Several types of molecular descriptors were derived, such as constitutional, topological, geometrical, charge, GETAWAY (GEometry, Topology and Atoms-Weighted AssemblY), WHIM (Weighted Holistic Invariant Molecular descriptors), 3D-MoRSE (3D-Molecular Representation of Structure based on Electron diffraction), molecular walk counts, BCUT descriptors, 2D-Autocorrelations, aromaticity indices, Randic molecular profiles, radial distribution functions, functional groups and atom centered fragments, by means of the software Dragon version 5 available in the Web for evaluation [10].We excluded the empirical and property-based descriptors.In addition, two flexible molecular descriptors were added to this pool of variables: the so-called Correlation Weighting of Atomic Orbitals with Extended Connectivity of Zero-and First-Order (DCW 1 and DCW 2 ), based in the Graph of Atomic Orbitals (GAO) [11].
The data set employed consists on a representative set of 200 normal boiling points (NBP) of organic molecules studied earlier [12].In this set it is found that 36 compounds do not obey the Similarity Principle [13], that is, NBP is a property that includes degenerated values and assigns the same number to several substances, even though different structures are involved.This type of conflicting molecules were removed from the set, thus leading to 164 molecules to be analyzed.
Since there are many molecules for calibrating the model, we decided to partition the set into two subsets composed of 82 structures, one for training the model and the other for testing its predictive performance.Notice that, as the POR relationship does not depend on regression coefficients, the size of both subsets can be the same, contrary to the case appearing usually in regressionbased analyzes when dealing with a great number of data [14].The compounds belonging to both the training and test series were chosen in such a way to have a representative sample of experimental NBP values in both subsets, and are shown in Tables 1 and 2. In other words, the members of each series were chosen in such a way that the experimental values of the compounds of the test set can be interpolated in the training set.

Principles of POR's based QSPR-QSAR
The methodology of POR [15] involves an extremely simple principle from the mathematical point of view.Consider a subset d = {d 1 , . . ., d i } containing i = 1, . . ., d molecular descriptors, usually called an information basis (IB).If a compound A is characterized with the subset d(A), and a compound B with the subset d(B), then two compounds A and B exhibiting an experimental property p can be compared (ranked) through comparison (ranking) of their single descriptor values according the binary relation "≤".That is to say, The demand "for all i" to set up the order relation is called "The Generality Principle," and this condition transforms Partial Ordering into a vectorial approach.Each molecule is characterized with a vector whose elements are its attribute values [16].When the inequality of equation ( 1) is true, then it is said corresponding descriptors for B, the two compounds will have identical order (rank) and will be considered as "equivalent" rather than "identical," belonging to the same equivalence class.In consequence, the binary relation "≤" becomes a quasi-order.Condition (1) gives rise to a net of comparisons among the N compounds of the training set.Notice that the d descriptors participating in the POR model need to have the same designations, i.e. "high" and "low," and it may be necessary to multiply some of these descriptors by -1 in order to achieve identical designations.Furthermore, in POR's based QSPR-QSAR analyzes, the set of compounds under study has to follow the Similarity Principle, and two molecules do not belong to the same equivalence class if they exhibit different property values.Therefore, the absence of equivalent molecules leads to the conclusion that, whenever the property does not involve degenerated values for different structures, a given compound C can be positioned in the net of comparisons of the model if and only if there exist two neighbors A and B with a lower and higher rank than C, respectively.This situation is depicted in figure 1.

The algorithm
Our proposed ranking approach is also valid for a single descriptor model net [17].Consider a training set a composed of N compounds.If we apply the condition (1) to this set then it will generate two different subsets a 1 and a 2 : in a 1 all the compounds will satisfy (1) and therefore this subset is called "ranking subset."The second subset a 2 will contain incomparable compounds that do not follow the rule.Notice that, if a 2 is an empty set, then there will exist a total order in a.It is still possible to order each compound Z belonging to a 2 by searching among the N − 1 compounds of set a (except Z) the corresponding two associated neighbors, allowing thus to generate new ranking subsets containing each one three compounds.
The first step of the algorithm consists on searching a subset of molecular descriptors D 0 (containing D 0 descriptors) from the pool D (with D total available descriptors), in such a way that these variables are able to generate N − 2 ranking subsets for N − 2 compounds of the training set according to the inequality (1).It is not possible to generate ranking subsets for the two compounds that have the highest (p max ) and lowest (p min ) value of the observed property in this set.Obviously, each descriptor in D 0 need to have the highest and lowest numerical values for the compounds exhibiting p max and p min , respectively.Otherwise, there would not exist ranking subsets for all these N − 2 compounds.The number D 0 is dependent upon the property, the training set of compounds considered, and the molecular descriptor set under investigation.
A second aspect to have in mind is that the subset of descriptors d participating in the POR model must be able to identify and characterize each molecule from the training set independently.Therefore, those descriptors of D 0 having lower degeneracy would meet better this specific requirement when establishing the model net.Needless to say that different combinations of d descriptors may result equally suitable to describe satisfactorily the property intervals of the training set.
In order to apply the ranking subsets intervals obtained from the training set to predict the property intervals of the "unknown" compounds belonging to the test set, the algorithm considers the concept of average ranks.A compound X not pertaining to the training series is able to have its descriptor values lying in more than one ranking subset.Therefore, average upper and lower property limits for X have to be calculated along these subsets.The average limits are then translated to the nearest-lying experimental property values for two compounds of the training set.

Results and discussion
In the present study, D = 1201 and D 0 results in 63 descriptors for the training set of compounds shown in Table 1.Since most of the molecular descriptors of D have high orientation, and our main intention is to show the performance of the algorithm proposed, the few descriptors with low orientation were not considered in the analysis.Also, it is not our purpose here to interpretate the models found in structural terms.
It is possible to try all possible combinations between the descriptors from D 0 to analyze the property intervals derived from the ranking subsets.However, the less degenerated the descriptors employed, the better the discrimination among the compounds in the model.The descriptor search for the model nets is performed in such a way that the variables are able to provide narrow prediction intervals for the compounds of the training set, and allow simultaneously to assign a position to the highest number of compounds belonging to the test set according with their descriptor values.

Best single descriptor-model net
Among the 63 available descriptors, the flexible variable DCW 1 is the least degenerated of all, and also satisfy the conditions mentioned previously.Table 3 provides an illustrative example with experimental intervals for ten ranking subsets obtained with the descriptor DCW 1 , where the designation for each compound and its associated experimental NBP is given.The resulting 80 ranking subsets for the training set lead to the experimental NBP intervals for each compound indicated in Table 4.As mentioned in the introduction section, we are concerned here only in that the algorithm enables to position the compounds in a certain property interval from the net, leaving the prediction stage (b) to be discussed in a next publication.As can be observed from Table 4, all the experimental intervals assign the correct position for the training molecules once these are removed successively from the set (taking one at each time) and relocating them in the ranking subsets net.This is a sort of leave-one-out cross validation technique [18,19], constituting a common practice in all regression analyzes.Now, as a further step to test the predictive power of the one descriptor model, the ranking subsets are applied to predict the intervals for the 82 com-  5. Two of these compounds (1 and 82) are not able to be predicted in the present situation, as these molecules have descriptor values out of the training set intervals.It has to be noticed, however, that if some of the compounds from the training set were excluded and recalculated the ranking subsets after that, then it would be also possible to predict the property intervals for these compounds.
It can be appreciated that 29 compounds have their property intervals well predicted, and that many others have experimental NBP close to the predicted intervals.

Two descriptors-model net
A higher number of descriptors involved in the POR model would tend to characterize better the intervals for the training compounds, especially when average ranks are employed, tending to generate shorter property intervals.This causes compounds of the test set to have their descriptor values lying in less number of ranking subsets, and therefore avoiding a great uncertainty when locating them inside the net.On the contrary, if such characterization is excessive, it would result much more difficult to position the compounds of the test set in the model net.For the training set considered here, two descriptors from D 0 that differentiate the best among the compounds are DCW 1 again and Q index , the topological descriptor "quadratic index" [20].
As can be seen from Table 4, all the training compounds are assigned a correct interval for two descriptors involved when practicing the leave-one-out procedure, and these tend to be narrower when compared with those encountered for a single descriptor model.The first ten ranking subsets for the two descriptors model are included in Table 6.When applying this model to predict the test set molecules, it is found that there are more compounds having descriptor values out of the intervals given by the net (12 molecules in Table 5).It is also noted that 32 compounds have their property intervals well predicted, and that many others have experimental values close to the predicted intervals.This model with DCW 1 and Q index results of better quality, although it predicts a smaller number of test compounds.

Three descriptors-model net
The model containing three descriptors corresponds to the worst case of the three presented.It is composed of the descriptors DCW 1 , RDF 050v [21], and Q index , with RDF 050v being a Radial Distribution Function (5.0, weighted by atomic Van der Waals volumes).Table 3 reveals that the length of the intervals has increased for some compounds of the training series.From Table 5 it is also appreciated that 27 compounds can not be predicted in the test set.

Conclusions
We introduced a novel algorithm based on partial ordering ideas that is able to assign properly experimental endpoint intervals to 82 training compounds, and further analyzed the predictability of the POR nets established by using a test set with 82 "unknown" compounds.The models including different number of molecular descriptors for characterizing the net tend to be predictive on this test set.It is very important to have the opportunity to assess the predictive performance of the training model via an external test set of molecules.The POR methodology includes an analogue of the leave-one-out cross validation technique usually employed in regression analyzes since, once located the interval where a compound X is to be predicted, this is performed with its neighbor compounds without taking into account the experimental value of X.However, this kind of leave-one-out procedure does not guarantee that the descriptors involved in the POR model are able to rank compounds from a test set.

Table 1
Experimental values of NBP ( • C) for the training set (82 compounds).

Table 2
Experimental values of NBP ( • C) for the test set (82 compounds).Figure 1. Representation of the two neighbors of C.
If equation (1) is false, then both A and B are incomparable and can not be assigned a mutual order.Obviously, if all the descriptors for A are equal to the A C B

Table 3
Illustrative example with experimental intervals for ten ranking subsets obtained with the descriptor DCW 1 .

Table 4
Experimental intervals for the training set derived from the ranking subsets of model nets including different number of descriptors.

Table 5
Experimental intervals for the test set derived from the ranking subsets of different model nets.

Table 6
Illustrative example with experimental intervals for ten ranking subsets obtained with the descriptors DCW 1 and Q index .