Chirality in a quaternionic representation of the genetic code

A quaternionic representation of the genetic code, previously reported by the authors, is updated in order to incorporate chirality of nucleotide bases and amino acids. The original representation assigns to each nucleotide base a prime integer quaternion of norm 7 and involves a function that associates with each codon, represented by three of these quaternions, another integer quaternion (amino acid type quaternion) in such a way that the essentials of the standard genetic code (particulaty its degeneration) are preserved. To show the advantages of such a quaternionic representation we have, in turn, associated with each amino acid of a given protein, besides of the type quaternion, another real one according to its order along the protein (order quaternion) and have designed an algorithm to go from the primary to the tertiary structure of the protein by using type and order quaternions. In this context, we incorporate chirality in our representation by observing that the set of eight integer quaternions of norm 7 can be partitioned into a pair of subsets of cardinality four each with their elements mutually conjugates and by putting they in correspondence one to one with the two sets of enantiomers (D and L) of the four nucleotide bases adenine, cytosine, guanine and uracil, respectively. Thus, guided by two diagrams proposed for the codes evolution, we define functions that in each case assign a L- (D-) amino acid type integer quaternion to the triplets of D- (L-) bases. The assignation is such that for a given D-amino acid, the associated integer quaternion is the conjugate of that one corresponding to the enantiomer L. The chiral type quaternions obtained for the amino acids are used, together with a common set of order quaternions, to describe the folding of the two classes, L and D, of homochiral proteins.


Introduction
Homochirality of nucleic acids and proteins is one of the attributes that characterize life on the Earth [1,2]. At present time all the living organisms in our planet have nucleic acids (DNA, RNA, etc.) with nucleotide bases that just take, between their two possible chiral forms, the one usually labelled as D (for dextro or right-handed); also, their proteins are chains of amino acids all of which are enantiomers of (exclusively) the class L (for levo, or left-handed) except by the amino acid glycine that is not a chiral molecule. To understand why the combination D for nucleotide bases and L for amino acids (and not any other) occurs in living systems is one of the greatest quests of Biology. The attempts to answer this question involves either biotic or abiotic arguments. In general the hypotheses of the first type assume that homochirality is determined by biological necessity and that it is the result of diverse selection mechanisms. Among the abiotic hypotheses we have those that propose a "frozen accident" as the responsible of the homogeneous chirality and those that assume the existence of some asymmetric force that selects just one of the chiral forms [3][4][5]. The problem with these theories is that, in general, they can not be experimentally checked. This difficulty is (at least partially) overcome by some hypotheses that claim that homochirality and the universal genetic code arose closely related. More precisely, that the genetic code, its translation direction and homochirality emerged through a common natural process of selection [6]. This approach has the advantage that it can incorporate the idea of an ancestral direct affinity between amino acids and nucleotide triplets which, these last, further acquire new functions, within a more modern translation machinery, in the form of codons and anticodons. The goal is that this early affinity between triplets of nucleotide bases and amino acids can at present be studied in standard laboratories by simply synthesizing small RNA-oligonucleotides [7]. This way preference of L-amino acids by D-bases triplets has been demonstrated [8][9][10][11][12][13][14][15][16]. Moreover, the non-biological affinity of D-amino acids by L-codons is also observed [8,16,17]. Besides, the hypothesis of coevolution of homochirality with the genetic code, makes a number of additional predictions such as the possibility that each codon can encode at least two different amino acids (according with the reading direction: 5´→ 3´or 3´→ 5´) so the eventual existence of more than a single code becomes possible [16]. We must mention, however, that, despite the possibility of existence of ancestral L-ribonucleic sites for D-amino acids, there is also experimental evidence of the impossibility of incorporating D-amino acids in protein structures in present biosynthetic pathways. Specifically it has been reported chiral discrimination during the aminoacylation in the active site of the aminoacyl tRNA synthetase and also during the peptide bond formation in the ribosomal peptidyl transferase center [18,19].
The aim of this article is to show how a mathematical representation of the genetic code recently reported by us [20] can naturally incorporate chirality in such a way that the resulting description be consistent with many of the previous observations. Our original representation is guided by a diagram that we have proposed to sketch the evolution of the genetic code (see Figure 2 in next Section). The diagram is based on pioneering ideas by Crick [21,22] and includes the physical concept of broken symmetry [23][24][25] in a very simple form that resembles the energy levels of an atom. The representation uses Hamilton quaternions [26,27] as main tool. These mathematical objects are a sort of generalization of the complex numbers and obey an algebra in many aspects similar to theirs but with the very important (for our purposes) property that the product is, in general, non commutative. In addition, the quaternions are ideal for representing spatial rotations [28] with important advantages over the classical matrix representation.
In our quaternionic representation we assign to each amino acid in a given protein two quaternions: an integer one according to which one of the 20 standard amino acids it is (type quaternion) and a real one that determines its order inside the protein primary structure (order quaternion). The type quaternions are obtained as a quaternionic function of the codons where each of the three nucleotides bases is associate to an integer quaternion. The form of this function being inspired by the code evolution diagram. The order quaternions, on the other hand, play a fundamental role in relation with the folding of the protein [29,30].
The integer quaternions that we choose to associate with the nucleotides bases belong to a maximum cardinality subset of the set H 7 (Z) = (a 0 , a 1 , a 2 , a 3 ) : a 0 , a 1 , a 2 , a 3 ∈ Z; a 2 0 + a 2 1 + a 2 2 + a 2 3 = 7, a 0 > 0 and even with the property that it does not contain pairs of conjugate quaternions. The set H 7 (Z) has 7 + 1 = 8 elements [31] and so the chosen subset has 4 quaternions as it should be. This suggest to partition the set H 7 (Z) in the form H 7 (Z) = H 7;D (Z) H 7;L (Z) (where the four quaternions of H 7;D (Z) and the four quaternions of H 7;L (Z) are mutually conjugates) and to associate the elements of H 7;D (Z) and H 7;L (Z) with the four D-nucleotide bases and the four L-nucleotide bases of D-RNA and L-RNA molecules, respectively. This way, chirality can be included in our formalism.
In the next Section two diagrams for the evolution of the genetic code are presented assuming that the two possible combinations (D, L) and (L, D) for the chirality of bases triplets and amino acid were present since a beginning. The diagrams frozen at certain step displaying two different genetic codes; the one that corresponds to the combination (D, L) giving the present day living systems standard code. Guided by these diagrams, in Section III, we consider the quaternionic representation of both codes and assign type quaternions to L-and D-amino acids in such a way that, for a given D-amino acid, the associated integer quaternion is the conjugate of the corresponding to the enantiomer L. Section IV is devoted to show the advantages of the quaternionic representation by considering the folding of L-and D-proteins. Some remarks are finally made in Section V.

Chiral diagrams for the evolution of the genetic code
Here we generalize, by including chirality, the diagram proposed in Ref. [20] to describe the genetic code evolution. First we take into account that Miller-Urey like experiments on the synthesis of organic molecules in a primordial environment [32,33] show the formation of amino acids in racemic mixtures. Also, as we have already mentioned, by studying the affinity of amino acids with small RNA-oligonucleotides the preference of D-and L-bases triplets by L-and D-amino acids, respectively, has been observed [16], so we assume that from the beginnings, the two chirality systems (D-bases/L-amino acids) and (L-bases/D-amino acids) evolve independent one of the other. Moreover, as we describe next, the proposed evolution diagrams for the two systems are very similar since we assume that the changes in the corresponding genetic codes are basically independent of the chirality. Chirality just manifests in settling the two affinity systems. Note the achirality of the amino acid glycine (G). The translational direction considered in each case is also shown. Rectangles with four bases imply fourfold degeneration with respect to those ones, so at that moment each of the four considered amino acids was encoded by 16 triplets.
To construct our diagram of the code evolution in Ref. [20] we have followed Crick [22] and have considered that, in the first evolution steps, only the second base of the bases triplets was effective in codifying (binding) amino acids. Here we assume that this fact is valid for the L-bases triplets as well as for the D-ones.  Figure 2: Complete diagram of the genetic code evolution for the chiral combination D-bases/L-amino acids. The one letter convention for amino acids is used and, for simplicity, the chirality subindices in bases and amino acids have been dropped. The direction of the temporal evolution is from left to right. Rectangles with two or more bases implies degeneration with respect to those ones. The broken lines link different sets of codons that encode the same amino acid in the case of sixfold degeneration. Arrows and common lines indicate what codons follow codifying the same amino acid and what will start to codify a new one, respectively, after the symmetry is broken (see text). The natural numbers at the right of the amino acids give the temporal order of the amino acids in the Trifonov consensus scale [34]. Accordingly only four L-(D-) amino acids could be codified, each one by one of the four enantiomers C I (I -cytosine), G I (I -guanine), U I (I -uracil) and A I (I -adenine) (I = D (L), respectively) independently of which the first and third bases are. In Figure 1, that sketches this first step of the codes evolution, this fact is denoted with a rectangle containing the four letters. This is consistent with Crick´s suggestion that only a few amino acids were coded at the beginning.
According with the diagram, C I would codify J-alanine (A J ); G I , glycine (G); U I , J-valine (V J ) and A I J-aspartic acid (D J ) (I,J = D,L or L,D) whatever the first and third bases are. It is worth noting here that the four amino acids that we assume were the first ones to be codified are the first four in the Trifonov [34] consensus temporal order scale for the appearance of the amino acids (column of natural numbers in Figure  2). The four amino acids A, G, V and D were also the first four that appeared under simulation of the primitive earth conditions in Miller experiments [32,33]. We must also point out the two reading frames we are considering: 5´→ 3´for D-triplets and 3´→ 5´for L-triplets.
In figures 2 and 3 we show how the evolution follows for the pairs (D,L) or (L,D), respectively. Since confusion is not possible, we have ignored the subindices that denote chiral class for notation simplicity.
As the left (right) part of diagram of Figure 2 (3) shows, our version of the primitive code is highly degenerate: in principle each of the four amino acids, A J ,G,V J and D J , could be encoded by 4 2 = 16 codons (see also Figure 1). Physically the idea of degeneration is closely related with the concept of symmetry and a very illustrative form to think about these concepts is by doing an analogy with the energy levels of an atom. In our case we would have four levels indexed each one with the letter corresponding to the second codon base, say C I , G I , U I and A I (main quantum number). We thus assume that, as the code evolves, the symmetry that causes that the amino acid codification be independent of the first (third) base of the codon, disappears. Because of this symmetry breaking, a part of the degeneration also disappears. In the diagrams each of the four initial levels splits into four new levels, one for each of the possible bases (C I , G I , U I and A I ) at the first (third) place of the codon (secondary quantum number). Now we have a total of 16 levels indexed each one by two letters (the first and second (second and third) bases of the codon). Each level is fourfold degenerate in the codons third (first) base. One of the new levels follows codifying the same amino acid as before that the level splits whereas the other three codify a new amino acid each. We indicate with an arrow the four groups of codons that conserve the amino acid and with a simple line those that substitute the amino acid by a new one.
As the code follows evolving it suffers new breaking of symmetry so that the third (first) base of some codons bring into use or, in the atomic analogy, some of the fourfold degenerate levels split into two levels each one twofold degenerate. Those levels pointed out with an arrow follow codifying the same amino acid whereas the other levels substitute it for a new one. Eventually, in subsequent steps, a few of the twofold degenerate levels split once more given two non-degenerate levels each. This is the case of codons that codify methionine (M J ), tryptophan (W J ) and (again) the stop signal. The case of isoleucine (I J ) is a particular one since the split level coincides with the twofold one which represents the two codons that follow codifying the same amino acid. This way, isoleucine is the only amino acid which is coded by three codons. The stop signal is also threefold degenerate since it is coded by two groups of codons one twofold degenerate and the other one non-degenerate. At this step of the evolution the code frozen to give what would be its present form. It is worth mentioning that the code evolution gives as a particular result that the amino acids serine (S J ), arginine (R J ) and leucine (L J ) are (would be) at present coded by two groups of codons each one. In the three cases one of the groups is fourfold degenerate and the other one is twofold degenerate, so that these amino acids are the only three which are sixfold degenerates. We point out this property in the diagrams with a broken line linking the two groups of codons. The two groups of codons that codify the stop signal are also linked by a broken line. We remark again the similarity of the two evolution diagrams which means that, in our description, the symmetry breaking and the code freezing must depend of causes other than chirality.
We observe that the diagram of Figure 2 is consistent with the present day standard genetic code ( Figure  4). The diagram of Figure 3, on the other hand, gives the hypothetical code shown in Figure 5. Actually, this code is not observed in Nature. We would argue that, although our world (and perhaps the whole Universe) has the accurate symmetry for the existence of at least two systems of homochirality, say (Dnucleic acids and L-proteins) and (L-nucleic acids and D-proteins), the refinements in the decoding and synthesis machineries of living organisms, maybe looking for better robustness and optimization of errorcorrecting tools, have broken that symmetry by discriminating between the two systems along the evolution causing the disappearance of the second one from the Earth.

Quaternionic representation of the genetic code and chirality
Based on the diagrams of Figures 2 and 3, we propose, within a common formalism, quaternionic representations of the genetic codes corresponding to both homochiral combinations (D,L) and (L,D) according with the scheme with I,J = D,L and L,D where H (Z) denotes the set of integer quaternions (Lipschitz integers), and Here In what follows, in order to simplify the notation, we assign natural numbers to identify the bases and the amino acids: by Eqs. (7) and (8): Stop L → α 21L = q 3D q 2D + γ 2D;24 + δ 2D;4 = q 3D q 4D + γ 4D;24 (β = 3, γ = 2,δ = 4 or γ = 4, δ = 2,4) P D → α 1D = q 1L q 1L (β = 1,2,3,4, γ = 1, δ = 1) Stop D → α 21D = q 2L q 3L + γ 2L;24 + δ 2L;4 = q 4L q 3L + γ 4L;24 (β = 4, γ = 2, δ = 3 or β = 2,4, γ = 4, δ = 3) The importance of working with objects that verify a non commutative algebra is evident from these functions since otherwise amino acids A J and R J , and also S J and L J , would have associated the same quaternion. The expression for the amino acid glycine (G) takes into account that it is not chiral. The factor 7 has to do with the fact that: a) the norm of the type quaternions can roughly be taken as a measure of the information needed to codify the corresponding amino acid in the sense that the larger the norm the larger the necessary information (see Ref. [20]); b) G is fourfold degenerates and that the norm for all the other amino acids which are fourfold degenerate is 49 (see Eqs. 7-10).

Folding of L-and D-proteins in the quaternions formalism
We have presented quaternionic representations of the standard genetic code for living systems, where the chiral combination (D-bases/ L-amino acids) is preferred, and also of a hypothetical genetic code for systems, in which we assume that the combination (L-bases/ D-amino acids) prevails among other possible chiral combinations. These representations reproduce the structure of the corresponding codes, particularly their degeneration. However, the fact that distinguishes the quaternionic representation over most of available mathematical representations is the resulting assignation of quaternions to the amino acids. Because of the advantages of using quaternions to describe spatial rotations, the association of amino acids with quaternions opens new horizons beyond the genetic code representation. In this context, we consider here the suitability of this association, together with our characterization of chirality, to take account of the folding of homochiral proteins formed by exclusively L or D amino acids.
The primary structure of a J-protein (J = L,D) formed by N J-amino acids is a sequence A J1 ,A J2 ,. . .,A JN with A Ji ∈ A J . Our aim is to obtain from this sequence the spatial coordinates of each one of the atoms of all the amino acids that constitute the protein when this one is in the native -or functional-state (tertiary structure). For L-proteins, we take as such the one corresponding to the protein in physiological solution whose coordinates can be obtained, after crystallization, by application of, for example, X-ray diffraction methods. That is the case of most of the proteins whose coordinates are stored at the Protein Data Bank [35]. For D-proteins, since in the laboratories have been synthesized very few such proteins [36,37], we take in these cases as tertiary structure the mirror-image of the corresponding experimental L-proteins structure.
In principle we restrict ourselves to determine the coordinates for just the alpha-carbon atoms of the proteins chain which is not a severe restriction since is known that there exist (at least for L-proteins) very efficient algorithms for going from this trace representation to the full atoms one [38]. We also take into account that, in our quaternionic representation, the J-amino acids sequence is expressed as a sequence of quaternions p J1 ,p J2 ,. . .,p JN with p Ji ∈ H αJ (Z). Under these conditions we proceed now to present an algorithm to determine the spatial coordinates of the alpha-carbon atoms of the protein.
First we observe that although adjacent alpha-carbon atoms are not covalently bonded their distance is notably stable and take very similar values for all the pairs within a given protein and also for those belonging to different proteins. So in our calculations we assume that all these distances are equal to a unique value d Cα−Cα = 3.80Å. Thus we determine on the unit sphere with center at the origin a point for each of the amino acids (alpha-carbon atoms) in the protein sequence. To the last one we assign directly the origin, the preceding one is located at the intersection between the axis z and the sphere surface (versor e z ). To each of the remaining alpha-carbon atoms we assign a point on the sphere surface that results of rotating the versor e z (north pole) by a quaternion. For the ith alpha-carbon atom in the J-sequence, the quaternion responsible of the rotation is denoted β Ji (i = 3, 4, · · · , N ). We then expand the chain of alpha-carbon atoms from their location on the sphere into the back-bone protein three dimensional configuration (see figure 6) by means of the following iterative procedure, where initially the r j´s are on the sphere surface: do i = 1, N − 2 δr = r i+1 do j = 1, i r j = r j + δr end do end do According with the algorithm, the distance between adjacent alpha-carbon atoms is the unit so, to establish the correct distance, we must multiply the final calculated coordinates by d Cα−Cα . It remains to determine how to calculate the quaternions β Ji (i = 3, 4, · · · , N ). In ref. [20] we do this in a somewhat heuristic way. We take into account that the ith amino acid interacts in some way with the i − 1 previous amino acids in the sequence and also with the N − i subsequent ones. Of course that in these interactions the effect of the medium should be incorporated in some form, for example in the form of effective interactions between amino acids. Actually we are trying for a sort of decodification and so we are not directly interested into the detailed form of the interactions, but we recognize that in any codification of information that involves those interactions, some trace of their general form should be. In general it is reasonable to think that the global interaction includes just two body (effective) interactions so by analogy we choose with generality for β Ji the normalized version of the quaternion where • denotes the quaternionic dot product: and c Jr ∈ H (R) (r = 1, 2, · · · , N ) are unknown real quaternions (order quaternions) which are determined by means of an optimization technique. As such we use the particle swarm optimization (PSO) procedure of Kennedy and Eberhart [39] taking as function of fitness the difference between the coordinates of the alphacarbon atoms calculated following the previous procedure and the corresponding experimental ones. For these last we take those directly read from the PDB (for L-proteins) or their mirror-images (for D-proteins). We take the rmsd (root-mean-square deviation) as a measure of the fitness difference, using to that effect Bosco K. Ho´s implementation of Kabsch algorithm [40]. Actually, it is enough to consider J = L since p Dr =p Lr so, if we take with i = (0, 1, 0, 0), then a similar relationship is verified by β Di : We observe that if we denote with x Li = ((x Li ) 1 , (x Li ) 2 , (x Li ) 3 ) the point on the sphere that results of rotating the versor e z by the quaternion β L i : (0, x Li ) = β L i (0, e z ) β L i , then the sphere point that results of rotating the north pole by β D i is given by say x Di is the mirror-image of x Li with respect to the plane x = 0 in a cartesian axis (x, y, z) (see figure  6). When we expand the two chains (L and D) of alpha-carbon atoms from their location on the sphere by using the previous iterative procedure they result to be the mirror-image one of the other. In figures 7 to 9 we show the L and D enantiomers of three small proteins as obtained by using our procedure: in figure 7 of the hormone glucagon (PDB ID: 1GCN -length: 29 amino acids); in figure 8 of the ion channel inhibitor osk1 toxin (PDB ID: 2CK5 -length: 31 amino acids) and, in figure 9, of a type III antifreeze protein (PDB ID: 1HG7 -length: 66 amino acids). In all the cases the theoretic curves are compared with the corresponding experimental ones as obtained directly from the PDB for L-proteins and from the mirror-image of these for D-proteins. The corresponding rmsd´s for the L-proteins are: 0.103Å for 1GCN; 0.091Å for 2CK5 and 0.163Å for 1HG7.

Remarks
Assuming that at the beginnings amino acids and nucleotide bases were synthesized from primordial elements in racemic mixtures and that D-bases were always more affine for L-amino acids whereas L-bases have preferred D-amino acids, we have proposed diagrams for the evolution of genetic codes which at present would settle the correspondence between codons and amino acids looking for those affinities. Actually, of both codes only that corresponding to the affinity system (D-bases/L-amino acids) is nowadays observed on the Earth. Although the existence of the chiral combination (L-bases/D-amino acids) is in principle possible, none of the organisms that live in our planet shows L-nucleic acids or D-proteins. However, the existence of life in other planets, including the possibility that it be governed by such a genetic code ( Figure  5), is an open issue.
Our evolution diagrams (Figures 2 and 3) are based on pioneering ideas by Crick and introduce in a very simple way the concept of broken symmetry. The fact that the diagrams describe the degeneration breaking and the code freezing following a similar pattern for both affinities systems, (D-bases/L-amino acids) and (L-bases/D-amino acids), means that we are considering that these aspects basically do not depend of the molecules chirality but of other physical, chemical and/or biological causes.
Inspired by the evolution diagrams we propose a quaternions based mathematical representation of the corresponding genetic codes. We assign to each nucleotide base an integer quaternion, so the codons are triplets of such quaternions. The representation assigns to each triplet another integer quaternion that is associated with one of the 20 amino acids (type quaternions). The bases quaternions belong to the set of eighth prime integer quaternions of norm 7 and the nucleotide bases chirality is introduced by partitioning this set into two subsets (Eqs. 4 and 5) of cardinality 4 each with their elements mutually conjugates and associating their elements with the D-and L-bases, respectively. The correspondences between triplets of quaternions and type quaternions for both chiral combinations, (D/L) and (L/D), are given by functions (Eqs. 7 and 8) that use the sum and ordinary product of quaternions and is such that the type quaternions assigned to both enantiomers of a given amino acid are mutually conjugates.
Apart of preserving the degeneration of the genetic codes as it should be, our representations distinguish among other mathematical representations of the genetic code because they assign quaternions to the amino acids as a final result so that, in view of the close relationship between quaternions and spatial rotations, a door towards the study of the proteins folding opens. In this context we propose an algorithm to go from the primary to the tertiary structure of L-as well as D-proteins. The algorithm uses, besides the integer type quaternions, a set of real quaternions associated with the order of the amino acids in the protein sequence. These order quaternions are basically the same ones for L-and D-proteins so the algorithm is such that for a given primary sequence the 3D structure of the L-and D-proteins are the mirror-image one of the other. Finally, another observation about this algorithm whose critical step is the building of the quaternion β Ji (Eq. 11). In ref. [20] we use for it an expression that involves the ordinary product between the type quaternion corresponding to the position i and all the others. However such expression is not adequate for describing with a common set of order quaternions the folding of L-and D-proteins. To overcome this problem, in this article, we have changed the ordinary product by a dot product between type quaternions. As a consequence the number of sets of order quaternions that adjust a given protein diminishes so that the search for the unique set that describe the folding of all the proteins (if it exists!) would be facilitated. We are currently working on this issue. Figure 9: Trace representation of the alpha-carbon atoms backbone for L-and D-1HG7. Red (dark grey) inner tube: from the coordinates obtained using our procedure. Cyan (light grey) external transparent tube: from the coordinates stored at PDB for L and its mirror-image for D.