QSPR modeling the aqueous solubility of alcohols by optimization of correlation weights of local graph invariants

Optimization of correlation weights of local graph invariants is an approach to model molecular properties and/or activities of chemical or/and biological interest. The essence of the approach may be described by means of three main steps: first, a descriptor which is a function of the weights of local graph invariants must be defined by the suitable choice among the different possibilities from the pool of molecular descriptors; second, correlation weights values which produce as large as possible correlation coefficient value between the selected property values and the descriptor data under consideration are calculated by Monte Carlo optimization procedure (the correlation coefficient is used as the quality objective function); third, a relationship such as property=C 0+C 1 descriptor has to be calculated and validated with structures of some training set resorting to the standard least square method. We obtain quite satisfactory results using this calculation procedure to model the aqueous solubility of alcohols whose statistical characteristics are:n= 30, r= 0.9843, s= 0.176, F= 870 (Training Set);n= 33, r= 0.9965, s= 0.0902, F= 4456 (Test Set);n= 63, r= 0.9931, s= 0.121, F= 4379 (complete set of alcohol molecules).

The importance of this sort of calculation rests on the fact that today more than 100,000 chemical substances are in use and they constitute a potential risk to the environmental health.It is obvious that it is not practically possible to generate all necessary experimental input for the risk assessment of these compounds.Thus, it appears necessary that part of the information concerning their fate and effect in the environment must be obtained by models, e.g., through comparison with structurally related, well-investigated compounds [18].
Alcohols are toxic materials and thus represent dangerous environmental pollutants especially in the case when a mishap happens and accidentally large quantities of alcohols pollute the environment.The first step in polluting action of alcohols is their solubility in water.Methanol, ethanol, and propanol mix with water in any ratio, whereas this is not so with other alcohols, mainly higher ones [19].Aqueous solubility is a particularly important property of organic compounds, especially in the field of pharmaceutical chemistry, biological chemistry, and environmental science.It is also helpful in understanding drug transport and environment impact.Alcohols are also technologically important materials and are used in the manufacture of a large number of products.And again, their usefulness in this respect also depends, among other things, on their solubility in water [20].Of all the molecular properties that can profoundly affect a compound's biological activity, aqueous solubility is probably one of the most fundamental and deserves attention in the early phases of drug discovery.Not surprisingly, therefore, aqueous solubility has been extensively studied, and a large number of computational methods for the estimation of this highly important property have been reported [21,22].
This paper is organized as follows: next section deals with the method and we present a succinct description of the computational procedure; then we display the results, making some pertinent comparison with previous data published elsewhere, and discuss them, whereas in the final section we analyze the general conclusions derived from this study.
This approach has been tested with the water solubility of 63 alcohols taken from [1].The CWs for calculating the descriptors have been obtained by means of the Monte Carlo method [3].The descriptors are calculated as where a k is the chemical element which is presented by kth vertex of graph, ECX k is the X -th order Morgan extended connectivity value of the k-th vertex, CW(a k ) and CW(ECX k ) are the CWs corresponding to the local graph invariants in LHFGs.

Results and discussion
Results of three OCWLGI probes based on the DCW(EC0), DCW(EC1), and DCW(EC2) are presented inTable 1, which have been carried out in order to validate reproducibility of the scheme.It has shown that statistical characteristics of the OCWLI models are the same ones in the running of these optimization procedures, vide Table 1.DCW values for each alcohol is shown in Table 3 and have been calculated with CWs of first probe OCWLI presented inTable 2.
The linear models for water solubility based on weighting the presence of chemical elements together with extended connectivity of first order (EC1) and obtained with Least Squares Method are presented below.We find it worth mentioning that computer algebra systems (CAS) such as Derive  [23] prove extremely useful for the calculation presented in this paper.Although CAS may not be suitable for massive computations because they are slow, they are extremely useful for algebraic and numerical investigations on relatively small problems because they allow us to set arbitrary precision, and even to carry out multivariate regression in exact rational arithmetic, thus avoiding any round off errors.
where n is the number of alcohols, r is the correlation coefficient, s is the standard error of estimation, F is the Fisher F ratio, r cv the leave-one-out cross-validated correlation coefficient, and s cv is the leave-one-out cross-validated standard error of estimation.log(1/S) Calc.Eq.( 3) stands for the log(1/S) calculated value using Eq. ( 3) and descriptors from the test set.
One can see from the Table 1 that the best results have been obtained with the DCW(EC1), because the descriptor gives a better model than the other ones for the test set.The CWs of Local Graph Invariants for water solubility modeling with the DCW(EC1) are presented in Table 2.
We have tried several molecular partitions for the training and test sets, but final results do not depend significantly on the way of choosing them, so now we present results for a current case.
The comparison of statistical characteristics of Eq. ( 3) and those ones derived from three-variable model of aqueous solubility of alcohols published previously [1] (n = 63, r = 0.9874, s = 0.1669) shows that one-variable model of Eq. ( 3) is better than the three-variable model to predict water solubility taken from [1].Besides, our results for the test set are true predictions, because only molecules belonging to the training set were employed to determine Eq. (3).When using the entire molecular set (i.e., 63 molecules) as in [1], we also obtained better results (i.e., r = 0.9908 vs. r = 0.9874).The results for the prediction of log(1/S) of the entire molecular set are given inTable 3 together with the corresponding deviations from the experimental data.
To improve results derived from the simple linear Eq. ( 3), we have taken into account that simple linear regressions involving a single descriptor restricts regression analysis considerably [24].Many correlations, particularly when involving molecules of different size, need not be linear.But even if we have molecules of the same or similar size, a quadratic regression may result in a better description of the relationship than a simple linear model.In general, one should test single descriptor regressions for quadratic dependence and, if warranted, for higher-order polynomial relationships or other functional dependence.Finding the best functional form for such regressions, besides being far from trivial, may also offer non-unique alternatives.In some cases, the difference in the quality of a regression when different forms are used is too small to make significant result and leave the question of functional form open.
We have tried higher-order polynomial expressions but results do not differ significantly from those obtained for the linear fitting polynomial [3].However, when trying with a particular alternative algebraic formula, an optimum exponential relationship, final results were improved in a significant degree.Followings are the regression equations for the different sets: The results derived from Eq. ( 6) are given in Table 3.The analysis of data reveals that regarding the training set both approaches give similar predictions (average absolute deviation = 0.13), but when considering the true predictions (test set), Eq. ( 6) provides better results than those provided by Eq. (3) (average absolute deviations = 0.10 and 0.04, respectively).

Conclusions
Optimization of Correlation Weights of Local Graph Invariant (in LHFG) may be used as a convenient tool for predicting water solubility of alcohols.One must take into account that the present calculation scheme is based on one-variable equations whereas previous results were derived from a threevariable fitting polynomial.Besides this, the use of a more general functional mathematical expression to write the fitting equation is convenient in order to get better predictions than those derived from simple polynomial structure.
A possible way to ameliorate present results could be achieved by resorting them to different local graph invariants [25].Research on this is under present development and results will be presented in the near future.

Table 2 .
Correlation Weights of local graph invariants for water solubility modeling with the DCW(EC1)

Table 3 .
[1]erimental (taken from[1]) and calculated values of the aqueous solubility of alcohols expressed as log(1/S) (absolute deviations are given between parenthesis) (Continued on next page)