New similarity-based algorithm and its application to classiﬁcation of anticonvulsant compounds

A similarity-based algorithm based on a previously developed model is applied in the classiﬁcation of two sets of anticonvulsant and non-anticonvulsant drugs. Each set is composed of a) anticonvulsant compounds that have shown moderate to high activity in the Maximal Electroshock Seizure (MES) test and b) drugs with other biological activities or poor activity in the MES test. The results from the analysis of variance (ANOVA) indicate that the proposed algorithm is able to differentiate anticonvulsant from non-anticonvulsant drugs. The proposed model may then be useful in the identiﬁcation of new anticonvulsant agents through virtual screening of large virtual libraries of chemical structures


Introduction
Despite the continuous development of 3D and 4D QSAR methodologies, 2D-descriptors still remain among the most widely used descriptors in pharmaceutical research today [1][2][3][4][5]. This is, in part, because 3D descriptors depend on the conformation used to represent the molecule, which not necessarily corresponds to the conformation responsible for activity, which may be sometimes difficult to define [6 -8]. Furthermore, although their physicochemical meaning is not always clear, 2D descriptors do not depend on the molecule conformation, have a low computational expense and can be easily calculated for all the existing, new, and in-development chemical structures [6,9].
There are, essentially, two major types of molecular descriptors [10]. The first of them has been defined as "holistic". Descriptors of this class are numbers that usually represent some important physical property of the whole molecule. These descriptors are either empirical or theoretical. Examples that could be mentioned among many others are: octanol-water partition coefficient, [11] Kier Shape Index and [12] the shape indices of Hopfinger [13] [14].
The other descriptor category consists of the socalled chemical substructures [1,10]. They could be interpreted as subgraphs of the graph that is associated to a molecular structure. These subgraphs are labeled or colored in a way that reflects information such as: atomic species, bond types and connectivity of two or more atoms from the structure. Some examples of descriptors in this category are atom pairs (AP), [1,14,15] topological torsions (TT), [1,10] atom sequences (AS) and the Index of Electrotopological state [15,16]. Chemical substructures are particularly useful in the generation of similarity algorithms for retrieval systems and compound classification [17,18], which are commonly used in virtual screening of large chemical structure libraries. One of the main advantages of similarity algorithms in pharmaceutical discovery is that little information is needed to formulate a reasonable query; similarity methods can be applied at the beginning of a drug discovery project when there is little or no information available on the molecular target [6]. We present a similarity-based algorithm generated on the basis of the one previously developed by Bagchi and Maiti [19]. We have applied the proposed model in the classification of two sets of molecules composed by both anticonvulsant and nonanticonvulsant drugs.

Motivation
Epilepsy is one of the most common neurological disorders that affect human condition. It is characterized by recurrent seizure attacks that range from a brief lapse of attention to severe, frequent convulsions due to synchronous neuronal firing. About 50 million people in the world suffer from epilepsy, especially in childhood, adolescence and old age [20,21]. Available treatment fails to control epilepsy in up to 30% of the patients. In non-developed countries about three-fourths of patients do not receive the treatment they need [21]. Moreover, even the new generation of antiepileptic drugs (AEDs) causes important side effects, including headache, drowsiness, diplopia, ataxia, dizziness, nausea, allergies, blood dyscrasias and hepatotoxicity [22]. Thus safer, more efficient, less toxic new AEDs are urgently needed.

Atom pair definition
In 1985 Carhart et al. [15] defined an atom pair as a substructure composed of two non-hydrogen atoms and an interatomic separation.

ðatom1descriptionÞÿðseparationÞÿðatom2descriptionÞ
"Separation" is measured as the number of atoms in the shortest bond-by-bond path that contains atoms 1 and 2. Since the "topological distance" is defined as the number of edges, by the shortest bondby-bond path, between two atoms [9], Carhart separation corresponds to the topological distance plus one. "Description" of each atom informs its chemical type, the number of non-hydrogen atoms attached to it and the number of bonding p electrons it bears. An asterisk or dot following an atom symbol represents the presence of a bonding p electron. Xn following an atom symbol indicates the presence of n non-hydrogen neighboring atoms.
For any given chemical structure denoted by j, the total number of APs d(j) that can be derived from it is obtained through the following expression: n being the number of non-hydrogen atoms present in the compound j.

Pharmacophore definition
Pharmacophore models are hypotheses on the 3D arrangement of structural properties, such as hydrogen bond donor and acceptor properties, hydrophobic groups and aromatic rings of compounds that bind to a biological target [23,24]. Our group has identified a pharmacophore related to anticonvulsant activity mediated by sodium channels blockade [25,26] through fitting in SYBYL 6.6 [27]. A model of the identified pharmacophore can be seen in Figure 1.
In the present work we have used information derived from this identified pharmacophore to generate a new mathematical model, specific for the search for potential new anticonvulsants, on the basis of the one from Bagchi and Maiti [19].

On similarity determination
There are two major classes of similarity coefficients: association and distance coefficients. The essential difference between them is that the latter considers the common absence of certain structural features as the evidence of similarity between two chemical structures, whereas the former does not. Mentionable examples of association coefficients are Tanimoto, Dice and Cosine coefficients. Examples of distance coefficients are Hamming and Euclidean distances [15,18].
The work from Chen et al. [15] indicates that the Tanimoto coefficient shows better performance than the Euclidean distance in 2D fragment-based similarity searching. Based on this information, we decided to use the Tanimoto coefficient as the similarity measure in the proposed model. We have tested the model with the Tanimoto coefficient in both its binary (Equation 2) and algebraic (Equation 3) forms: where a is the number of types of fragments in compound A; b is the number of types of fragments in compound B and c is the number of types of fragments shared by compounds A and B. where na i is the number of times that the ith AP appears in the AP set derived from compound A and nb i is the number of times that the ith AP appears in the AP set derived from compound B.

Construction of the compounds sets
Two sets of compounds were constructed. Set A includes 12 compounds with high and moderate activity in the MES test (ED 50 below 100 mg/kg, in mice) as well as 11 drugs with other therapeutic uses, extracted from Merck Index 13 th [28]: aspirin (analgesic), abikoviromycin (antibiotic), oxantel (antiparasitary), lindane (pediculicide), acipimox (antihyperlipoproteinemic), acecarbromal (hypnotic), etanidazole (antineoplasic), ABT-594 (analgesic), bergapten (antipsoriatic), methizasone (antiviral) and methocarbamol (muscle relaxant). Two anticonvulsants with poor activity in the MES test were also included (ethosuximide and TV-1901). The anticonvulsant agents included in Set A ( Figure 2) are structurally diverse. Carbamazepine was chosen as the reference drug of this set, this is, the structure to which all the other molecules in the set will be compared through the proposed algorithm. Carbamazepine's measured activity in the MES test is the highest among set A (ED 50 ¼ 37 mmol/kg ip) [25]. Set B (Figure 3) is defined by 26 compounds. Besides the non-anticonvulsant drugs already used in set A, it includes 13 anticonvulsants: one benzoquinoxaline derivative (NBQX), nine benzodiazepine and tetrahydroisoquinoline derivatives selected from a set of drugs tested in the MES test [29,30], as well as carbamazepine and valpramide. THIQ -10c, the compound with the highest measured activity in the MES test among set B (ED 50 ¼ 5.17 mmol/kg) was chosen as reference drug. Note that the anticonvulsants in set B are structurally less heterogeneous than those in set A.
The structures of each set with high and moderate anticonvulsant activity in the MES test will be referred to as active compounds from now on. This category is presented in black in Figures 2 and 3. On the other hand, the compounds with other therapeutic uses or poor anticonvulsant activity in the MES test are presented in grey and will be referred to as inactive compounds.

Construction of the new model
We propose the following model:  A is a measure of the expected relative activity to a reference drug; this is chosen as the most active structure of the working set (carbamazepine for set A and THIQ-10c for set B). S is the binary form of the Tanimoto similarity coefficient. ni represents the number of ith AP in the reference drug; Dni stands for the difference of ith AP between the reference drug and the tested drug; l i is the separation in the ith AP and a is the separation in the longest AP that can be derived from the identified pharmacophore (or, in other words, a represents the maximal topological distance in the pharmacophore plus one). For the term between brackets we consider only those types of AP that contain heteroatoms, based on the common knowledge that bioactive compounds tend to present heteroatoms and heterocycles and that functional groups involved in non-covalent short-distance interactions are essential for the binding of the drug to its site of action. b and c are constants which moderate the influence over the predicted relative activity of the similarity coefficient and the second term between brackets.
The value of each one of the terms in the sum between brackets can be interpreted as the contribution of the ith type of AP to the activity of the structure. Whether these terms are positive or negative (adding to or reducing the relative activity of the molecule that is being compared) is defined by the Dni factor, which is the only one among the factors involved in the model that can take values either greater or smaller than zero. Since the reference drug of each set is chosen as the most active structure in it, the types of AP that are present in the reference drugs can be thought of as desired features involved in the interaction of the molecule and its site of action. This was one of the fundamental hypotheses in the development of our model. Therefore, if the number of APs of the ith type is greater in the reference drug than in the tested compound, then the tested molecule is lacking a desired feature and the fraction of the activity that can be explained, through that particular element will be smaller in the tested compound than in the reference one. If the number of ith type elements is greater in the tested compound than in the reference drug, then the tested compound has more of a desired characteristic and a greater probability of this feature being expressed at the time of interacting with the site of action.
However, are all the types of atom pairs equally significant to the activity of a particular structure? Since the reference is the drug with the highest activity among the available structures, we believe that the more times a particular type of AP appears in the reference structure, the more important is the contribution of that type of AP to the activity. Thus, our model includes the factor ni in the numerator of the second term between brackets. We also believe that, having indentified a specific pharmacophore, we should give more attention to the types of AP that somehow accomplish the characteristics of the pharmacophoric pattern. The topological distances are one of the features of the pharmacophore that can be considered. We decided to give prominence to those types of AP that include heteroatoms and a separation of six, since six is the maximum number of atoms comprised in the possible AP derived from the identified pharmacophore for anticonvulsant drugs with activity in the MES test.
Although the pharmacophore itself could be used as a 3D-search query in virtual screening for new anticonvulsant agents through compound databases, this requires the previous generation of reasonable, low-energy conformers of the molecules in the database to be screened. Our model presents the advantage that, since only topological, 2D-features of the pharmacophore are considered, the results of the query do not depend on the conformation of the molecules from the database being screened.

Application of the models
The AP set derived of each of the compounds considered for this study was generated. Table I shows the complete sets of AP for carbamazepine and THIQ-10c, the reference compounds of set A and set B, in that order. Once we had knowledge of all the AP sets the similarity coefficients of Tanimoto were calculated as specified in Equations (2) and (3); the values of these coefficients for each structure are shown in Table III. In order to determine the common heteroatom APs, the APs of each structure was compared with the APs of the corresponding reference drug. Different values for b and c were tested for each set of compounds. b was given values of 0.5, 0.75 and 1. c was given values of 50 and 75. As an example, Table II shows the types of heteroatom AP common to carbamazepine and phenytoin, and the occurrence of each type in each one of these structures. The calculated value for the negative term between brackets in the proposed model is also shown; c was chosen as equal to 75 for the example presented in Table II. Once we calculated the value of A for the different forms of model (4), analysis of variance (ANOVA) was performed in order to determine if the proposed model could be used to classify anticonvulsant and non-anticonvulsant agents. In other words, it was checked if there was a statistically significant difference between the mean values of A for the active and the inactive categories of both set A and B. It was also verified if that difference was either similar or higher, in terms of statistical significance, than the difference between the mean values of the Tanimoto coefficients for both groups (active and inactive) and both sets.
Scatter plots of log ED 50 in the MES test versus A (obtained through Equation (5) using Tanimoto coefficient in its algebraic form) were drawn in order to determine if there was correlation between the   activity predicted with our model and the experimental activity measured in the MES test, for both sets A and B. For comparison purposes, the scatter plots of log ED 50 versus the Tanimoto coefficient in its algebraic form were also drawn. The model was finally validated with a set of ten compounds with bioactivities different from anticonvulsant activity ( Figure 6). Table IV shows the values of A for one of the forms of the proposed model that showed good performance in both sets A and B, both with the binary and algebraic forms of the Tanimoto coefficient: ð5Þ Table V shows the results from the ANOVA test for the different forms of the proposed model that were evaluated and for the bare Tanimoto coefficient (in its binary and algebraic form). We present the value of the F Statistical (which represents the ratio between the variance between-groups and the variance withingroup) and the p-value (which indicates the level of significance of the difference between the mean values of the considered categories). It can be observed that  the performance of various forms of the proposed model in the classification of compounds from set A is similar to that of the Tanimoto coefficients, while in the case of set B the difference between the mean value of A is more significant than the difference between the mean values of the Tanimoto coefficients for each considered category. Taking into account that most of the anticonvulsants which compose set B are structurally related to THIQ 10c, results seem to indicate that the proposed model is effective in the identification of anticonvulsant compounds whose structure is close-related to the structure of the compound chosen as a reference (Figure 4). Figure 4 also shows that the difference between the categories is statistically much more significant when using in the model the Tanimoto coefficient in its algebraic form. This reflects the fact that the differences between the mean values of the Tanimoto coefficient for the active and inactive categories is also much greater for Set B for the algebraic than for the binary form. Figure 5 shows that there are strong correlations between the measured log ED 50 in the MES test and the values of A. The correlation coefficients r for both sets A and B (0.8149 and 0.9055, in that order) are higher than those obtained when plotting log ED 50 versus the algebraic form of the Tanimoto coefficient (0.7962 and 0.8568). Table VI shows the results from the validation of model (5) (using the algebraic form of the Tanimoto coefficient and taking THIQ10c as the reference compound). Nine of the ten inactive compounds have low A values that fall inside or below the 0.95  confidence interval for the inactive category (see Figure 4). The A value of the remaining compound (lefetamine), although higher, falls below the 0.95 confidence interval for the active category.

Conclusions
A new similarity-based algorithm for the classification of anticonvulsant and non-anticonvulsant drugs is introduced and applied in the classification of two sets of compounds, both composed of anticonvulsant and non-anticonvulsant agents in the MES test. The ANOVA test showed statistically significant differences between the mean values of the algorithm for each considered category. Results indicate that the proposed algorithm performs better in the identification of anticonvulsant compounds structurally close-related to the reference compound chosen for the query. The algorithm could be applied in the search for new anticonvulsant agents through virtual screening.