|
|
||||||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
BIOINFORMATICS |
1 Department of Chemistry, New York University, New York, New York 10003, USA
2 Courant Institute of Mathematical Sciences, New York University, New York, New York 10012, USA
| ABSTRACT |
|---|
|
|
|---|
Keywords: in vitro selection; RNA pool design; mixing matrix; sequence-structure map; graph theory
| INTRODUCTION |
|---|
|
|
|---|
Many RNAs identified from random pools have simple structural motifs (e.g., stemloop, stembulgestemloop) (Lee et al. 2004
). Indeed, our graph-based analysis of random pools (25100 nucleotides [nt]) showed that different RNA secondary topologies are far from uniformly distributed, with low yields for multiply branched structures, although complex structures gradually become more frequent as RNA length increases (Gevertz et al. 2005
). Interestingly, recent experimental findings suggest that enhancing the structural diversity of RNA pools increases the possibility of obtaining novel RNAs with high activity (Carothers et al. 2004
, 2006
). Specifically, GTP aptamers with high-binding affinities are found to be more complex structurally than low-binding-affinity aptamers. The principal reason for the lack of structural diversity in random pools is due to incomplete and inefficient sampling of the astronomical size of the sequence space; random sequence sampling is inefficient because structures are not uniformly distributed in sequence space (Gevertz et al. 2005
). To overcome this problem, heuristic approaches have been used to enhance the structural diversity of RNA pools. For example, structured pools have been synthesized by maintaining a constant stemloop (GTP aptamer selection) (Davis and Szostak 2002
) and by introducing random segments in existing RNA structures (e.g., purine nucleotide synthase and domains of group I ribozymes) (Jaeger et al. 1999
; Ohuchi et al. 2002
, 2004
; Lau et al. 2004
; Yoshioka et al. 2004
). Recent works have also investigated the effects of sequence length (Legiewicz et al. 2006
) and nucleotide composition (Knight et al. 2005
) on recovery of specific RNAs. In addition, different functional classes of single-stranded RNAs have been found to have similar nucleotide compositions, implying evolutionary convergence (Schultes et al. 1997
). The success of heuristic approaches depends on the details of the introduced sequence biases and the RNA function targeted. It is thus a challenge to develop systematic pool design approaches based on deeper understanding of pool sequence and structural complexity for the discovery of novel and complex RNAs.
To enhance in vitro selection experiments, RNA pools must possess sufficient sequence and structural complexity to ensure that the target RNA property exists in the pool. Given that we know little about the distribution of active RNAs in sequence and structural space (Carothers et al. 2004
), an important goal of pool design is to maximize sequence and structural diversity without synthesizing all possible sequences. Even if complete coverage of sequence space is possible, not all regions of the space are likely to be productive for finding novel RNAs. This was suggested by recent analysis showing that the properties of GTP aptamers are correlated with their sequence/structural information content (Carothers et al. 2004
). Unlike sequence space, the complexity of RNA structure space is more difficult to characterize quantitatively. At the secondary structural level, structural distributions of RNA pools can be analyzed using graph theory (Gan et al. 2003
; Kim et al. 2004
). Such an analysis shows that random pools are not structurally diverse (Gevertz et al. 2005
), suggesting that pool structural diversity depends on how the sequence space is sampled. Indeed, understanding the relationship between sequence and secondary/tertiary structure spaces is essential for the design of effective pools for in vitro selection of RNAs. Thus, developing methodologies for generating and analyzing sequence pools possessing diverse RNA sequences and structures could enhance in vitro selection technology. Ultimately, a deeper understanding of the distribution of active RNAs in sequence and structural space will emerge through productive interactions between theoretical analysis and experiment.
Here we develop a computational approach for improving pool sequence and structural diversity by sampling sequences representing diverse regions of sequence space. We show that effective sampling of sequence space regions can be performed using nucleotide base "mixing matrices" for nucleotide transition rates applied to chosen starting sequences. Mixing matrices applied to given sequences are essentially generators of sequence pools and can be used to guide the reactants during in vitro selection experiments. Since we show that different regions of the sequence space are associated with distinct structural distributions, designed pools with specified target secondary structures can be obtained by optimizing a set of mixing matrices and starting sequences to approximate the target structural distributions. Figure 1 illustrates the relations among pool sequence/structure analysis, mixing matrix and starting sequence, and pool synthesis.
|
| MATERIALS AND METHODS |
|---|
|
|
|---|
Our strategy is to define substrate pools with both random and biased sequence mutations around specific (starting) sequences to generate designed sequences with constant and variable regions. For this purpose, we introduce the mixing matrix M, whose elements specify mixing (or "contamination" or "doped") in the four phosphoramidite (A, C, G, and U [i.e., T in DNA synthesizer]) vials; applying mixing matrices to starting sequences leads to designed sequence pools. (Note that since successive nucleotide bases are selected independently from the vials, sequence synthesis methods correspond to the zeroth-order Markov process.) This representation of pool generation or synthesis using mixing matrices enables computational analysis of pool sequence and structural diversity. Our goal is to define, via computational analysis, an optimal set of starting sequences, mixing matrices, and associated weights for a given target structural distribution in the pool. Figure 1 illustrates pool design and synthesis via mixing matrix and analysis of sequence/structure space.
For pool synthesis using four vials or ports, the corresponding mixing matrix M is a 4x4 matrix that specifies the molar fractions of nucleotide components A, C, G, and U (T) in the four vials. Thus, the "ij" element of M (i.e., M ij ) denotes the molar fraction of base j in the vial "for base i." It describes how we can dope that vial for i by introducing other bases j
i into it as well. The design problem involves selecting those doping ratios and starting sequences. For example, M AU is the fraction of U (T) nucleotides in the vial for A, M AA is the fraction of A in the vial A, and M UA is the fraction of A in the vial U (T). Thus, the elements of each row of the matrix sum to unity:
|
| (1) |
If the DNA synthesizer is to produce a fixed sequence, then a vial for base i has 100% base i and zero fraction of other bases (i.e., M ii =1 and M ij =0 for i
j). If M ii <1 and M ij
0, contaminations are introduced, as specified by the off-diagonal elements of M. The expected number of mutations in a synthesized sequence is determined by
|
| (2) |
where Nj is the number of nucleotides of type j in the original sequence.
Ideally, we would like to determine the mixing matrices M for a target structural distribution or on the basis of specific biological-motivated contamination protocols. In practice, the inverse design problemspecify M and analyze the resulting structural distributionis much easier to perform. We thus construct different mixing matrices motivated by biological covariance mutations and analyze their coverage of sequence space via a standard clustering method. Direct modeling of pools generated by a specific mixing matrix can be made by exploiting correlations between bases in folded RNAs. For example, the bases in paired and unpaired regions are correlated, allowing assignment of matrix elements for mutating bases in such regions.
Our biological motivation for choosing the mixing matrix classes is as follows. We consider mixing matrices with symmetric elements, M AU=M UA, M CG=M GC, M GU=M UG, to preserve base pairs. Such matrices cover the sequence subspace approximating covariance mutations (e.g., AU to UA, CG to GC, GC to UA). Covariance mutations have been used to analyze the secondary structure and sequence consensus of RNA sequence families. For example, this approach has been successfully applied to search for tRNA-related sequences and other small RNAs (Eddy and Durbin 1994
). Alternatively, to disrupt stems and generate new structures, we can consider mixing matrices that do not preserve base pairs. Such matrices include asymmetric matrices without the property of covariance mutations. Noncovariance mutations, including random mutations, are commonly used to generate sequence pools for in vitro selection applications.
To sample the sequence space, we define five classes of mixing matrices motivated by biological considerations, based primarily on sequence transformations associated with covariance mutations. The mixing matrix classes are characterized by the following matrix elements: (A) varying diagonal elements M ii with the condition M AA=M CC=M GG=M UU; (B) M CC=M GG=1; (C) M AA=M UU=1; (D) M AC=M UG=1; and (E) M CA=M GU=1. Within each class, several mixing matrices are constructed whose elements are distributed uniformly in steps of 0.25. A total of 22 mixing matrices representing the five classes are displayed in Figure 2. The matrix classes to which they belong are as follows: class A matrices 16, class B matrices 710, class C matrices 1114, class D matrices 1518, and class E matrices 1922. Note that in vitro experiments effectively use random pools generated by a constant 4x4 mixing matrix, where all 16 elements are 0.25; this corresponds to our matrix 4.
|
|
In RAG, RNA graphs are organized into n-vertex families, and members of a family are ordered using a topological index (i.e., Laplacian eigenvalues) (Fera et al. 2004
; Gan et al. 2004
). Structural complexity can be measured by the graph's vertex number (V) and the second smallest Laplacian eigenvalue (
2). For example, a linear chain has a smaller eigenvalue than a branched structure. The number of motifs in an n-vertex family increases with n, the number of vertices. For example, the 6-vertex tree family has six distinct trees and the 7-vertex family has 11 trees (see RAG Web resource at http://monod.biomath.nyu.edu/rna). For easy reference, each tree motif is labeled by vertex number and ordering within the family; for example, members of the 6-vertex family are labeled 61, 62, ..., 66. Since vertex number (V) is related to RNA sequence length L, these are constant length pools; in fact, we found empirically that L=20(V1) (Gan et al. 2003
). Our pool design will focus on tree structures because RNA folding algorithms for tree structures are efficient; computationally demanding pseudoknot folding algorithms are also available (Rivas and Eddy 1999
; Ren et al. 2005
). RNA graph theory also provides a complete set of pseudoknot and nonpseudoknot motifs for more general assessment of pool structural diversity.
Starting sequences for pool generation
The six starting sequences with distinct tree structures (Fig. 3) are 70S (chain F) (80 nt), tRNA (81 nt), P5abc domain of group I intron (56 nt), GTP-binding aptamer (69 nt), modified P5abc domain (51 nt), and modified GTP-binding aptamer (54 nt). As shown in Figure 3, distinct tree structures are represented as graphs by converting stems to edges and other structural elements (e.g., loop, bulge, etc.) to vertices according to tree graph rules developed previously (Gan et al. 2003
). As shown in Figure 3, the Laplacian eigenvalue (
2) indicates the structural complexity of starting sequences. Generally, the starting structure allows exploration of the structural neighbors of that structure via mutations. For random mutation rates (constant matrix elements of 1/4), the generated pools have no memory of the starting sequence. We generate pools with all possible combinations of 22 mixing matrices and six starting sequences for pool structured designs.
|
|
| (1) |
where the four-component vector X i , i=1, 2, ..., n, represents the nucleotide base:
|
| (2) |
The matrix Y represents the sequence pool generated by M with starting sequence S. For example, if
|
| (3) |
|
| (4) |
|
| (5) |
The probability of finding a new sequence S' in the pool can be calculated from Y. For example, P(S'=ACU)=Y 11Y 22Y 43=0.10.10.7 and P(S'=GUC)=Y 31Y 42Y 23=0.10.10.1. Similarly, we can calculate the frequency of sequences with a specified base-pairing scheme using this mathematical formulation. However, a rigorous mapping of sequence space (Y) to shape space (possible RNA graphs) requires an RNA folding algorithm, as described in our previous work (Gan et al. 2003
).
The challenge in computational pool design is to find an optimal set of mixing matrix (M), starting sequence (S), and weight (or pool fraction) for generation of target-structured pools. In principle, the mixing matrix can be calculated using statistical thermodynamics from the distribution of shapes in the designed pool. Assuming that the designed pool consists of N noninteracting RNA molecules, the probability of finding topology t in the pool is
|
| (3) |
where E(Si ) is the energy of sequence Si ,
=1/kT,
(Ei ) is the density of states, and S i,
=1/kT,
(E i) is an RNA topology operator defining tree or pseudoknot shapes enumerated by RNA graph theory. Equation (3) defines the relation between sequence pool {Si } and target structural distribution P(t). Recently, we calculated P(t) distributions for 25100 nt random pools using a folding algorithm and a program for converting secondary structures into tree graphs (Gevertz et al. 2005
). Thus, the goal is to determine the sequence pool {Si }, or mixing matrices generating that pool, to produce the target distribution P(t). In the Appendix, we describe a practical protocol for finding optimal mixing matrices approximating the target P(t) based on analyses of sequence space and pool structural distribution. Alternative pool design methods may also be developed based on Equation (3).
Pool sizes
For practical reasons, our computations used relatively small pools of 10,000 sequences. To show the effect of pool size, Figure 4 plots the frequency of several tree motifs (41, 42, 51, 52, 53, and 61) for pool sizes of 5,00060,000 sequences using mixing matrix 4 (MM4) and the initial tRNA sequence. We see that the pool fractions for distinct tree motifs saturate beyond 5000 sequences, indicating that the error due to sample size is small. The rapid saturation of pool fraction stems from mapping secondary structures using simple graphs. If detailed motif features (size of loops, stems, etc.) are incorporated into the mapping, larger pool sizes will certainly be required.
|
| RESULTS |
|---|
|
|
|---|
To analyze the clustering patterns in sequence space, we cluster all sequences generated by the mixing matrices using a standard clustering technique (e.g., MDS), allowing visualization of sequence similarity/dissimilarity properties (Cox and Cox 1994
). A similar procedure is commonly used for investigating the diversity of chemical compound libraries (Xie et al. 2000
). Such analysis helps establish the relation between each mixing matrix and the generated sequence space. Given a pool of sequences, we define Hamming distances (number of dissimilar bases) between all pairs of sequences (see Materials and Methods), allowing data projections in 2D, 3D, and higher dimensions.
Figure 5, A and B, shows the 2D and 3D clustering of sequences generated by 22 mixing matrices using starting sequences for the modified P5abc (Fig. 3E) and 70S (Fig. 3A) RNAs, respectively. In Figure 5A, we see that the sequences generated by the five mixing matrix classes and the P5abc starting sequence cover distinct regions of the sequence space, especially the boundary and central regions. The boundaries are spanned by matrix classes BE, and the central region by matrix class A. Intriguingly, the random MM4 produces sequences that are localized in sequence space, showing that the standard approach does not provide an efficient sampling of diverse regions of sequence space in agreement with observations. More adequate sampling of sequence space is provided by the nonrandom mixing matrices. The 70S starting sequence yields similar global 2D and 3D clustering patterns: the five matrix classes yield clusters in distinct sequence regions (Fig. 5B). Although the 22 mixing matrices of Figure 2 provide a comprehensive coverage of the sequence space, some regions remain sparsely populated, indicating that the matrix classes must be expanded for more complete coverage. Still, the chosen matrix set is diverse enough for initial assessment of our structured pool design concept.
|
|
|
8%, for the random mixing matrix, versus 79%, 46%, and 34% in matrix pools 79, which do not mutate C and G bases, respectively. At the other extreme, matrices 2022 produce small proportions (5%6%) of 42 tree structure. For the 31 tree motif, mixing matrices 5, 6, 14, 15, 19, 20, 21, and 22 generate higher pool fractions than the random pool 4, whereas matrices 713 yield considerably lower numbers. Thus, motif distributions depend on both the mixing matrix and the starting sequence because different sequence space regions result. Because the overlap of sequence regions of our 22 pools is weak, the motif distribution is very different in each case (Fig. 5). Second, we note a pattern in the correlation between 42 and 52 trees and between 31 and 41 trees. With the Figure 3E starting sequence, Figure 6 and Table 2 show that sequence pools from matrices 713 have a large proportion of 42 and 52 tree structures compared with the random pool 4. Similarly pools from matrices 1422 possess >30% 31 and 41 tree structures. This pattern emerges because the 42/52 and 31/41 tree-motif pairs are related by an internal loop or bulge, which can be created by a few mismatched base pairs.
Third, the structural distributions generated by a tRNA sequence (53 tree motif; Fig. 3B) differ from those for the modified P5abc domain (42 tree motif; Fig. 3E) in one important respect (Table 2). The most likely motifs are the simpler 51 and 52 trees rather than the starting 53 motif. MM1, for example, generates only 5% 53 motif, but 26% 52 tree. In contrast, MM7, which preserves C and G bases, generates 23% 53 trees, while other combinations of mixing matrices and starting structures yield almost no 53 trees (Table 2). The mean mutation rate for MM7 from the starting tRNA sequence (Fig. 3B) is
0.1 (
8.5 positions among 81 nt). Thus, the 53 motif populations produced by matrices 1 and 711 are much higher than in random pools (1.3%). The difficulty of generating significant populations with the tRNA-like 53 tree motif likely stems from the lower thermodynamic stability of 53 compared to 51 and 52 trees. Our analysis shows that matrices 79 generate sequences that are favorable for stabilizing the 53 motif because these matrices preserve energetically favorable CG base pairs.
To increase the population of complex folds like the tRNA-like 53 tree motif, we consider refining the mixing matrices 79. Since class B matrices 79 produce a higher frequency of 53 tree, we search for matrices in the neighborhood of this class by exhaustively varying the elements in each row with
M ij =0.2, yielding 56 possible cases. Assuming that each row is independent, the total number of mixing matrices around the class B matrix region is 562 or 3136, since two rows (second and third) are identical and the other two rows (first and last) have a total of 56 cases each. We filter the 3136 trial mixing matrices, yielding better than a 23% tRNA-like 53 tree structure. Remarkably, 12 of the 3136 mixing matrices for tRNA-like topology fulfill our requirement forming 53 motifs. We use these "MMT" matrices (Fig. 7) to generate graph-structural distributions with tRNA shapes, as shown in Table 3. For example, MMT6 generates a 51% tRNA-like 53 tree motif with 15 mutations out of 81 bases.
|
|
Sequence/structure correlations exist in designed pools
The above survey of tree structural distributions provides an analysis of RNA shapes in designed pools. We now analyze sequence/structure distributions at the nucleotide base level generated by the 22 mixing matrices starting with a 51-nt P5abc domain (Fig. 3E). In Figure 8, we use sequence Hamming and tree edit distances to quantify sequence and structure distances, respectively, as defined in Materials and Methods. Recall that the Hamming distance is the number of differing letters between two equal-length RNA sequences aligned end to end (Hamming 1987
). The tree edit distance between two (full) tree secondary structures measures the minimum sum of the cost (insertion, deletion, and replacement of nodes) along an edit path for converting one tree into another (Hofacker 2003
).
|
As the mixing matrices are altered, the distribution of the tree edit distances also changes. Generally, tree edit distance increases with mutation rate. For example, Figure 8 shows that tree edit distances become larger as diagonal elements of matrix classes A, B, and C decrease. Changing C and G bases (matrix class C) has a larger effect on the starting structure than changing A and U bases (matrix class B), as evident from the pool distances from the origin. This is due to lower free energies associated with GC base pairs compared to AU base pairs. Thus, Figure 8 indicates that the distribution of sequence/structure distances from the initial sequence is controlled by the elements of the mixing matrices. Although the patterns of sequence/structure distributions are not sensitive to the starting sequences (data not shown), the densities within the localized regions are markedly changed. Figure 8 shows that the pools generated by most mixing matrices (except for 1, 8, and 9) and starting sequences of a modified P5abc domain produce a single cluster. We find that contour plots with string edit distance or base-pair distance (data not shown) show somewhat less information about pool structural properties than those with tree edit distance. Other secondary structure measures may also be used to capture structural differences among folds in the same vertex or tree class. For example, it is informative to know the distribution of stem, loop, and bulge sizes (Fontana et al. 1993
).
Parameter optimization can lead to design of structured RNA pools
The preceding analysis of sequence space and assessment of structural distributions generated by nonrandom mixing matrices allow design of target structured pools. Here we use the pool design algorithm (Appendix) to develop several structured pools by selecting an optimal combination of starting sequences, mixing matrices, and associated weights {(Si , M i ,
i )}. The best combination for a target pool is dictated by the frequency data (Fig. 6; Tables 2, 3).
To illustrate, Table 4 shows four examples of designed pools that are rich in specific tree structures (e.g., 41, 51, 52); also displayed are their pool characteristics (mixing matrix weights and tree motif frequencies). Specifically, our target pools are: Pool TA with 41 and 42 structures; Pool TB with 51, 52, and 53 structures; Pool TC with 42, 52, and 53 structures; and Pool TD with 41, 42, and 53 structures. Pools TA and TB are 4- and 5-vertex pools, respectively, and Pools TC and TD are pools with mixed n-vertex structures. Each designed pool represents an optimal combination of starting sequences, mixing matrices, and associated weights derived using Step 5 of our design algorithm (see Appendix). Briefly, we initially choose pool fractions T 1, T 2, ..., Tn for target motifs and the number of mixing matrices to approximate the target pool. We then use Equation (6) in the Appendix to calculate the weight
1, which depends on T 1, starting sequence S 1, and mixing matrix M 1. Next, we minimize the error between the target and estimated target pool fractions, Equation (8) in the Appendix, over all pools generated by starting sequence/mixing matrix pairs {(Si , M i )}. This procedure yields optimized starting sequences, mixing matrices, and weights; the mean mutation rate is calculated based on these sequences, mixing matrices, and their weights.
|
Pool TB, targeting 20% each of the 51, 52, and 53 structures, is generated from the Figure 3A sequence for 51 with matrix 13 at a weight of 0.18 and the Figure 3B sequence for 53 with matrix T12 at a weight of 0.82. For this pool optimization, we expanded our mixing matrix/starting sequence repertoire to include those in Table 3 (the 12 mixing matrices for generating pools with a high frequency of 53 motifs, which are extremely rare in random pools). Thus, this optimization was performed over the set of 144 (22x6+12x1) mixing matrix/starting sequence pairs.
Resulting Pool TB contains 20% each 51, 52, and 53 tree motifs, compared with 23%, 16%, and 0% for the random pool (MM4), matching the target exactly. We found that using the 12 MMT matrices dramatically increases the population of the 53 motif (at a cost of decrease of 31, 41, and 42 motifs). The 61 structure (9%) is the next highest species in Pool TB.
Target Pools TC and TD are mixed pools with both 4- and 5-vertex tree structures, designed from our 144 mixing matrix/starting sequence pairs. The targets for Pool TC are 42, 52, and 53 tree motifs, and those for Pool TD are 41, 42, and 53 tree motifs (20% for each). Pool TC is generated by the Figure 3E sequence (MM9, 0.60) and the Figure 3B sequence (MMT2, 0.40), and Pool TD is produced by the Figure 3B sequence (MMT6, 0.329) and the Figure 3F sequence (MM13, 0.608). The results are as expected: Pool TC has frequencies for 42, 52, and 53 motifs of between 19% and 20%; Pool TD has frequencies for 41, 42, and 53 trees of 17%, 20%, and 20%, respectively, all within 3% of the target.
Our designed pools above, involving three of the 22 mixing matrices and two of the 12 MMT matrices, only touched the surface of possibilities. Still, in practice, it might be preferable to approximate a target pool using a small number of mixing matrices. Once our algorithm is automated (Appendix), exploration of pool design can be routinely performed.
A designed pool improves the selection of GTP aptamers
We now apply our pool design approach for enhancing GTP-binding aptamers. Szostak's group recently found that the GTP aptamer's binding affinity is correlated with the informational complexity (Carothers et al. 2004
, 2006
). Informational complexity is correlated with structural complexity (e.g., number of stems, vertex number of tree graph). As the information content and binding affinity decrease (Carothers et al. 2004
, see their Fig. 1, panels A and B), the aptamers have simple structures such as 21 or 31 tree motifs. Specifically, a high-affinity GTP aptamer with high informational complexity (Carothers et al. 2004
, see their Fig.1, panel C) has the 52 tree structure (Fig. 3D). Interestingly, no GTP aptamer with a 42 tree structure (Fig. 3F) has been reported, although it is structurally similar to the 52 tree. Because the frequency of the 42 motif is only 12% in the random Pool TF (Table 4), we propose designing a GTP aptamer pool by enriching the pool with 52 and 42 motifs. Our target pool fractions (Ti ) are 20% for 42 and 26% for 52. Our optimization yields Pool TE (Table 4) as a combination of two subpools: the Figure 3D sequence (MM13, 0.625) and the Figure 3F sequence (MM10, 0.375). The frequencies of 42 and 52 trees in the designed pool are 21% and 26%, respectively, nearly as desired and very different for the 12% and 16% distributions of these motifs in the random Pool TF. The sequence/structure contour plots in Figure 9 show differences between the designed and random pools; the designed pool has a relatively high mean mutation rate of 0.349.
|
| DISCUSSION |
|---|
|
|
|---|
Thus, our work suggests that designing pools for enhancing in vitro selection can follow several research avenues. Maximizing sequence and structural diversity broadly can increase the probability of finding a given RNA property using nonrandom mixing matrix/starting sequence pairs. An advantage of this approach is that designed pools can be directly implemented in pool synthesis. Alternatively, we can target a specific structural distribution by determining optimal mixing matrices and starting sequences without explicit sequence/structure mapping. Our targeted pool design can be applied to known structures, novel motifs (Kim et al. 2004
), complete sets like n-vertex pools, or perhaps submotifs of RNAs (Zorn et al. 2004
). Of course, a more comprehensive set of mixing matrices covering wider regions of the sequence space should be sought systematically. For example, matrices conserving noncanonical base pairs (AC, CA, GA, AG, etc.) can complement our current set, which conserves canonical base pairs; there are 12 such classes from a total of 16 possible base pairs.
Another design theme is enrichment of pools with structures resembling a target-active molecule. We illustrated this approach using GTP-binding aptamers. The conventional approach designing pools in the sequence neighborhood of a target molecule (Lau et al. 2004
; Ohuchi et al. 2004
; Yoshioka et al. 2004
), however, does not ensure that the designed pools will cover the structural neighbors of the target molecule, unless sequence mutations are made to localized sequence segments, as is commonly done in many experiments. In contrast, our optimized pool design approach (Appendix), allows enrichment of pools with specific RNA topologies or structures (e.g., tRNA-like 53 tree). In addition, novel tree topologies in the neighborhood of the target molecule, as suggested by structural enumeration (Kim et al. 2004
), could be similarly engineered.
Clearly, further developments of sequence/structure analysis techniques are needed to improve the pool design and overcome specific limitations. Understanding the sequence/structure relationship is one of the most fundamental biological problems not only for RNA but also for proteins. In our analysis, we are limited by the usage of numerical secondary structure folding algorithms, which are still imperfect and inefficient for predicting pseudoknot structures. However, our risk has been reduced here by "folding" of small RNAs (<100 nt) only and focusing on statistical properties (e.g., frequencies of topologies). A general strategy for improving structure prediction is to consider many suboptimal structures using, for example, the Boltzmann sampling method (Ding and Lawrence 2003
).
Ultimately, RNA tertiary and higher-order folding is essential to understand RNA function. Perhaps progress on this problem will be realized in the near future. For now, we offer our sequence pools possessing diverse RNA secondary structures as an approach to enhance in vitro selection technology.
Our pool design algorithm can be fully automated given target RNA shapes (and possibly starting sequences). We are developing a publicly available Web server to allow experimentation of pool design and analysis of RNA pool properties (e.g., base composition, size distribution of stems, bulges, etc.), and to define optimal mixing matrices for pool synthesis. Experimental synthesis of designed pools (specific structural motifs and their frequency) can be performed by using optimized starting sequences, mixing matrices, and associated weights. When available, location of this server will be noted on our group Web site (http://monod.biomath.nyu.edu). We hope that this tool will help stimulate the productive interaction between theoretical and experimental efforts.
| APPENDIX |
|---|
|
|
|---|
Our pool design algorithm involves the following steps:
i ) to approximate the target RNA shape distribution. The mathematical procedures for this step are detailed below.
The designed pool is composed of k smaller subpools defined by the set {(Si , M i ,
i )}, i=1, 2, ..., k. The above pool design algorithm can be fully automated given target RNA shapes (and possibly starting sequences). We are planning to make publicly available a Web server to allow experimentation of pool design and analysis of RNA pool properties, and to obtain mixing matrices for pool synthesis. Experimental synthesis of designed pools can be performed by using trial Si , M i , and
i .
In Step 3, the pool structural distribution is calculated by mapping RNA secondary structures into graph space. This is done by predicting secondary structures of all sequences using the Vienna RNAfold package and then converting them into tree graphs, as described elsewhere (Gevertz et al. 2005
). It is known that 73% of known base pairs are predicted by free-energy minimization algorithms such as RNAfold for sequences with <700 nt (Mathews and Turner 2006
). For greater accuracy, the Boltzmann sampling method can be used to generate a set of 1000 suboptimal structures (Ding and Lawrence 2003
), although at a higher computational cost (1000 times pool size). Specifically, base-pairing information in the .ct file generated by the RNAfold program is used to convert a secondary fold into a tree graph. The topologies of the folds are determined using Laplacian eigenvalues of tree graphs as implemented in our RNA Matrix Program (Gan et al. 2004
) (server available at http://monod.biomath.nyu.edu/rna). Specifying tree topologies using eigenvalues is inexact because different topologies can have the same spectrum; the assignment error rate is a few percent for small tree topologies (<10 vertices). This step is similar to the RNAshapes program, which uses bracket notations for representing secondary structures (Giegerich et al. 2004
; Steffen et al. 2006
). Unless stated otherwise, each sequence pool has 10,000 sequences, which is adequate for assessing structural distributions using simple tree graphs. Structure prediction and conversion to tree graphs for 10,000 80-nt sequences require
1 h on an SGI 300 MHz MIPS R12000 IP27 processor.
In Step 5, we approximate a target structural distribution by optimizing a set of starting sequence/mixing matrix pairs based on pool structural frequency data. Generally, we consider a designed pool composed of k subpools, each generated with a mixing matrix/starting sequence pair and associated with a weight
i : p(S 1, M 1,
1), p(S 2, M 2,
2), ..., p(Sk , M k ,
k ), where
1+
2+...+
k =1 and p(Si , M i ,
i ) denotes synthesizing the
i fraction of the pool sequences using starting sequence Si and mixing matrix M i . Optimization of the three pool parameters Si , M i , and
i can be formulated as follows: If the nx1 matrix T is the target distribution with Ti fractions of structures 1, 2, ..., n and Fl (Si , M i ) is the pool fraction of structure l generated by starting sequence Si and mixing matrix M i in Tables 2 and 3, the pool parameters (Si , M i ,
i ) can be optimized by the following equation:
|
| (4) |
where
=(
1,
2, ...,
k ) subject to the conditions
1+
2+...+
k =1 and
i
0. Since experimental implementation of pool synthesis is simpler with fewer mixing matrices, we consider the solution of
for k=2 below; the optimization procedure can be generalized. Formula (4) with only two mixing matrices M 1 and M 2 reduces to
|
| (5) |
The solution for the only weight is
|
| (6) |
The estimated pool fractions for the other shapes or topologies 2, 3, ..., n are derived from the known
1, F 1(S 1, M 1), and F 1(S 2, M 2) as follows:
|
| (7) |
We then optimize (S 1, M 1), and (S 2, M 2) by minimizing the error
|
| (6) |
The above procedure will allow us to obtain the optimized parameters
1, (S 1, M 1), and (S 2, M 2) for a target distribution T. The convergence of the procedure depends on the number of mixing matrices and starting sequences, or coverage of the sequence/structure space.
| ACKNOWLEDGMENTS |
|---|
|
|
|---|