|
|
||||||||
BIOINFORMATICS |
1 Department of Chemistry and Biochemistry, and 2 Department of Molecular, Cellular, and Developmental Biology, University of Colorado at Boulder, Boulder, Colorado 80309, USA
Reprint requests to: Rob Knight, Department of Chemistry and Biochemistry, Campus Box 215, University of Colorado at Boulder, Boulder, CO 80309, USA; e-mail: rob{at}spot.colorado.edu; fax: (303) 492-7744.
| ABSTRACT |
|---|
|
|
|---|
Keywords: RNA secondary structure; base composition; selection; self-organization
| INTRODUCTION |
|---|
|
|
|---|
Because organisms vary widely in genome GC content in a manner consistent with directional mutation pressure (Sueoka 1962
, 1988
), we might expect the different parts of the RNA molecule to change in composition at different rates due to the different selective constraints in different regions, just as the three reading frames within mRNAs change in composition at different rates that reflect the average effect of mutations in each frame (Muto and Osawa 1987
). Specifically, third position changes have the least effect because they are often synonymous, and second position changes have the greatest effect because they often substitute an amino acid that is chemically very different. This gives rise to substantially different slopes when regressing the GC content at a particular codon position against the overall GC content and also affects the amino acid composition of the protein correspondingly (Sueoka 1961
; Muto and Osawa 1987
; Lobry 1997
; Sueoka 1999
; Singer and Hickey 2000
; Knight et al. 2001
). Indeed, different RNA molecules such as tRNAs, rRNAs, and mRNAs also change in composition at different rates relative to overall GC content (Muto and Osawa 1987
). Even within a single molecule, the paired and unpaired regions of 16S rRNA in bacteria and archaea have been shown to differ in slope substantially (Wang and Hickey 2002
). In this article, we test whether these differences in response to overall genome GC content hold for finer-grained structural categories, within both large and small subunit rRNAs in all three domains of life. We also test whether the differences in response are due to differences in purifying selection in the different regions, or whether they are due to intrinsic differences in the amount of base-pairing expected in sequences of different composition (Schultes et al. 1999
).
In RNA, each nucleotide can be assigned to one of six secondary structure categories: stem, loop, bulge, junction, or end, or a type of unpaired base that we provisionally call "flexible" (Fig. 1
). Stems are the base-paired regions of the molecule. Loops, bulges, and junctions are unpaired regions enclosed by stems. Let the degree of an unpaired region be the number of stems attached to it. Then, loops have degree one, bulges have degree two, and junctions have a degree higher than two. The ends are all unpaired bases on the 5' and 3' end of the molecule. Flexible basesalso known as "freely rotating joints" (Schuster et al. 1994
), although this may be a misnomer at the tertiary structure levelmake up unpaired regions that connect two stems but that are not part of a closed RNA structure.
|
In this study we asked the following four questions about rRNA structure and composition:
| RESULTS AND DISCUSSION |
|---|
|
|
|---|
The base composition for each of the six structural elements and the overall base composition of the molecule are visualized in composition space (Fig. 2
). An important feature for orientation in this space is Chargaff s axis. This axis, where the amounts of C and G are equal and the amounts of A and U are equal, indicates the line in composition space where Watson-Crick base-pairing holds exactly. Deviations from Chargaff s axis tell us about compositional differences due to processes other than changes in GC content, which can simply result from compensatory mutations in stems. Our results show that all structural elements have distinct compositions. The compositions of the whole molecules and the stems show linear distributions along Chargaff s axis, as expected (Schultes et al. 1997
), with considerable variation in GC content but very little variation in the other directions (Fig. 2B
). Remarkably, the three unpaired regions that contain a substantial number of bases (loops, bulges, and junctions) have separate distributions. The ends and the flexible bases are scattered throughout composition space, because there are very few bases in these categories, so the sampling error is large. Therefore these latter two categories are excluded from the rest of the analysis.
|
We present several general observations based on these calculations. First, looking at the composition of the total molecule, LSU sequences are more biased than are SSU sequences, and bacteria are more biased than are archaea, which are more biased than are the eukaryotes. Second, the molecules contain a purine bias, which consists of more G than A: ~60% G for the archaea and bacteria and up to 94% for eukaryotes (in the SSU eukaryotes, the other part is 6% U). Third, most of the variation in GC content of the total molecules can be explained by the stems that form very similar distributions along Chargaff s axis. The stems have an almost equal bias in all domains of life toward U and G, because of wobble base pairs. Interestingly, SSU sequences have a higher GU bias in their stems than LSU sequences do. Finally, the unpaired regions explain the purine bias in the molecules. For both archaea and bacteria, we find that the bulges are the most biased, that the loops are least biased, and that the junctions are between the loops and the bulges. The purine bias is on average 65% A in these domains. In eukaryotes, the bias in the unpaired regions is much smaller overall, and the order from least to most biased is as follows: loop, bulge, junction.
Because rRNA sequences are biased toward purines and because the overall composition is constrained by a sum, the paired regions and the unpaired regions will necessarily differ in composition. Specifically, a line drawn through points representing the composition of the paired and the unpaired parts of an RNA molecule will pass through the overall composition, showing that the compositions of the paired and the unpaired regions differ from the overall composition in precisely opposite directions. However, the magnitude of the change in composition can differ, because the paired and the unpaired regions can contain different numbers of bases, and the different types of unpaired regions, for example, loops, bulges and junctions, are not constrained to share the same composition. Thus, for example, if GC pairs were preferentially incorporated into stems, the compositional differences between paired and unpaired regions would be much greater than would be the case if the bases that participate in pairs were randomly chosen from the whole molecule. Similarly, the amount of the sequence contained in each structural category is potentially free to vary, affecting the extent to which each component can differ in composition from the overall sequence. Thus the compositions of individual structural components cannot be inferred from the number of base pairs and the overall composition of the molecule, and this compositional information may provide important clues about the assembly of RNA structures.
The unexpected differences in composition among the three different unpaired structural categories suggest that these categories should be considered separately in studies of RNA composition, and underscore the importance of the fine-grained approach.
Do the different structural categories have the same composition in the large and small subunit rRNA, and across all domains of life?
The three domains of life diverged billions of years ago and, although the rRNA molecule is conserved for function in the different domains, the nonfunctional parts presumably varied independently in each lineage. Since there is no known sequence homology between the two ribosomal subunits, these subunits have no apparent shared ancestry. It is therefore surprising to see the same patterns of variation across all domains and both subunits. These patterns of variation, or space in which the rRNA molecules can mutate freely without losing their function, are represented by the tight distributions in composition space.
Figure 3
shows the compositions of the structural elements for large and small subunit sequences from archaea, bacteria, and eukaryotes. The distributions for LSU and SSU sequences within one domain are remarkably similar with respect to location and variation. The separation among the various structural elements is more pronounced in the SSU sequences, because many more SSU sequences than LSU sequences were available for analysis.
|
Samples that look distinct by eye need not be statistically different. Therefore, we applied Monte Carlo simulations to determine whether the differences between any combination of two samples (structural elements) within one species and domain (126 combinations in total) were significant. The calculations showed that the differences between all combinations but one were highly significant: The remaining P-values were <0.02 and for all SSU samples the P-values were <1/n, where n is the number of randomizations (10,000 in our experiment). Consequently, the different structural categories have significantly different compositions in the large and the small subunits and in the three domains of life, although these differences might be due to overall differences in the composition of the molecule (see below).
Despite the statistical significance of the differences, the observed compositional biases are visually strikingly similar across both subunits and all domains of life. Thus we tested whether the patterns were significantly similar using one-tailed two-sample t-tests on the distances between different groups. Looking separately at the differences in subunit, domain, and structural element tells us which variable causes the most difference in means. Figure 4
(top) shows the distance distributions of all possible combinations (Fig. 4A
) and matches across subunits (Fig. 4B
), domains (Fig. 4C
), and structural elements (Fig. 4D
). Comparing clusters within a domain and a structural element on subunit gave the highest significance (P = 0.00032, t = 3.5, df = 286). In other words, we found the greatest similarities between samples that came from the same structural category and the same domain but from different subunits. The visual similarities between clusters within a subunit and structural element, but across domains, were confirmed by the t-test (P = 0.00075, t = 3.2, df = 298). The data did not cluster by structural element (P = 0.68, t = 0.47, df = 310).
|
Are the constraints on rRNA composition due to natural selection?
Although the different structural components of rRNA are tightly constrained within characteristic regions of the space of possible compositions, these constraints might arise naturally from the process of RNA folding rather than because of purifying selection on the natural rRNA molecules. To test whether the differences between structural components within a molecule and the constraints on the composition of each structural component were due to selection, we compared the natural sequences to arbitrary, randomized sequences with the same composition. Any effects due to selection on the natural sequences should not be observed in the randomized sequences.
Because obtaining structures for long, arbitrary RNA sequences is prohibitively expensive, we estimated the secondary structures of the randomized sequences using the Vienna RNA folding package (Hofacker et al. 1994
). Because the predicted structures are likely to contain errors, we also predicted the structures for the natural rRNA sequences using the same methods. This allowed us to separate effects due to inaccuracies in the structure prediction, which would be expected to be similar for natural and randomized sequences, from effects due to special properties of the natural sequences. We thus examined three types of data: annotated structures of natural sequences (NA), computer-predicted structures of the natural sequences (NP), and computer-predicted structures of randomized sequences (RP). We used the NP structures to test the effects of computer prediction, and we used the RP structures to test whether the compositional biases in structural components depended on the sequence, as opposed to the composition, of the natural rRNAs (the randomized sequences were constrained to have the same composition as the natural sequences).
The compositional biases observed in RP structures are much more similar to the annotated sequences than was expected (Fig. 5
, top and bottom). Comparing the NP structures to the NA structures reveals some loss of information due to the computer predictions (Fig. 5
, top and middle). Specifically, the variance of the samples increases in the NP structures, and some of the distinction among structural components is lost. Remarkably, however, the separation among loops, bulges, and junctions is still visible. The prediction of the composition of the stems is very good, probably because base-pairing dominates in the predictions.
|
We also tested whether the compositions of each structural category in the NA, NP, and RP structures were significantly similar to one another by using the same test as for similarities between domains and subunits. On the lower half of Figure 4
are the graphs associated with the comparison of natural annotated (NA) structures with the computer predictions of the natural sequences (NP), and the predictions of the randomized sequences (RP). The statistics confirm the visual observations discussed above. Results of t-tests between the subsets and the distribution of all combinations (Fig. 4E
) show that matches across the computer predicted structures (NP vs. RP) (Fig. 4H
) are most significant (P = 2.1x109, t = 5.9, df = 2578). The matches both across NA and NP structures (Fig. 4F
), and across NA and RP structures (Fig. 4G
) are still highly significant (P = 1.6 x107, t = 5.1 and P = 5.9 x107, t = 4.9, respectively), despite the observed shifts of the unpaired regions in composition space.
Consequently, the different compositions of paired and unpaired regions, and of the different types of unpaired regions, do not depend on the sequence (to the limits of our ability to predict the structure with RNAfold) but only on the overall composition of the molecule. This suggests that differential selection for composition in the different structural categories does not cause the differences in composition, but rather that they arise automatically from the process of RNA folding.
Are the different responses to overall GC content in paired and unpaired regions due to natural selection?
If the constraints on the composition of the bases in each structural component are not due to selection, the different responses of each category to overall changes in genomic GC content might not depend on selection either. Accordingly, we tested whether the slope of the regression line relating GC content in each structural component and in the rRNA molecule overall or in the coding sequences in the genome differed between the natural rRNA sequences and randomized sequences with the same composition.
Figure 6
(top) shows the known correlation between genomic GC content at the selectively neutral third codon position and GC content of the total ribosomal RNA (Muto and Osawa 1987
). Positive correlations between the GC content of the third codon position in protein-coding regions and the GC content of paired and unpaired regions in rRNA have been observed in bacteria (N. Sueoka, pers. comm.). The same positive correlation holds true for each structural category individually, and the major difference in slope is between paired and unpaired elements. Graphs of the GC content of the ribosomal RNA versus the GC content in the different structural elements magnify these differences (Fig. 6
, middle), since the values are now constrained by a sum, and, at low overall GC content, the composition of the stems is thus much closer to the composition of the unpaired regions than at high overall GC content. The slopes of the stems are much steeper than the slopes of the unpaired regions. There is no systematic distinction in slopes among loops, bulges, and junctions. We find that the correlations are positive for all structural elements (i.e., stem, loop, bulge, and junction) for both subunits and all domains. This means that there is no compensation in base composition across different structural elements.
|
We visualized these correlations in the tetrahedron by grouping sequences by GC content and color-coding them accordingly (Fig. 7A
). A given color in each structural element thus refers to the same set of sequences, which are grouped by the GC content of the total molecule. The simplex gives us more information than the previously shown graphs in the sense that we can see the relative positioning of the sets with similar GC content in the different structural elements. The clusters of sequences with similar GC content are still distinct clusters in all structural elements.
|
Randomized sequences show strikingly similar patterns to the natural sequences (Fig. 7B
). These sequences are constructed by calculating the base composition on 2% intervals on a line through the mean of the SSU bacteria, parallel to Chargaffs axis, creating 100 random sequences of length 1500 in each interval, folding the sequences with RNAfold, and applying the same classification as used throughout the analysis. The randomized sequences form very smooth distributions through composition space with seemingly mathematical precision. The clusters of sequences with the same GC content in their total molecule (dots in the graph) are visible as tight clusters in each structural category. This pattern (Fig. 7
) has two implications: First, the base composition of structural categories is consistent at a given sequence composition, and second, similar base composition in the whole molecule implies similar base composition in each structural category.
Several mechanisms might influence these structure-dependent compositional biases. The first is purifying selection, which would cause the nucleotide composition of the whole sequence (and thus of all elements of the structure) to change in one direction by mutation, limited by the rate at which deleterious mutations are filtered out by selection. Purifying selection would explain the difference in slope between paired and unpaired elements of related and functional sequences in terms of different functional constraints for each structural element (Wang and Hickey 2002
). For comparison, in coding sequences, the three codon positions have different rates of change in response to changes in genome GC content, which can be interpreted in terms of purifying selection (Muto and Osawa 1987
; Sueoka 1988
; Lobry and Sueoka 2002
). However, the purifying selection model would predict that randomized sequences would show no difference in slopes between paired and unpaired regions, because they have no functions that need to be conserved and, in any case, share no evolutionary history.
Contrary to this prediction, we found that even randomized sequences have different rates of response to change in composition in each of the structural elements. Although it is possible that purifying selection accentuates these differences, much of the observed pattern can be attributed to the effects of folding, and claims about the extent of purifying selection based on these slopes (Wang and Hickey 2002
) should be treated with caution. Purifying selection is not required to explain the compositional differences among stems, loops and bulges, although it may affect details of the slopes.
The second mechanism is adaptive (or positive) selection, which means selection in favor of a particular composition, presumably because the composition is required for function, such as GNRA tetraloops (Woese et al. 1990
) and the other motifs described above. Selection for a particular sequence could in principle generate any possible composition, divided in any way among the structural components. In other words, the function of the ribosomal RNA might require more of certain bases in certain structural components, and this positive selection might generate the compositional differences (Lao and Forsdyke 2000
; N. Sueoka, pers. comm.). We need adaptive selection to explain the existence of functional RNAs and many ubiquitous structural motifs, but we do not need it to explain the compositional biases, because they also occur in nonevolved, nonfunctional sequences as an effect of RNA folding. However, positive selection might explain the subtle deviations in real rRNAs from what is expected by chance based on the randomized sequences.
Although rRNA sequences are highly selected and conserved, the compositional biases are consistent with those in randomized sequences, suggesting that the compositional biases in all structural elements are inherent to any sequence with the same base composition. Thus, the major force behind the formation of structural biases appears to be what we call "self-organization," the intrinsic factors such as base-pairing and stacking that drive secondary structure formation.
What explains the trends in the composition of the structural elements?
Having demonstrated that the different structural components of rRNA differ in composition from one another in both subunits and all three domains of life and that these differences appear to be driven by the overall composition of the molecule, we tested which parameters affect the result. First, we investigated the accuracy of the RNA folding in terms of its ability to assign bases to the correct structural categories. In addition to the base composition of structural elements, we examined the fraction of bases in all categories. We analyzed the NA, NP, and RP structures. We found that the fraction of bases ending up in each structural feature is approximately the same for all domains of life and that there are consistent differences between large and small subunit sequences: SSU rRNA has a higher percentage of base pairs than LSU sequences do (Fig. 8
, left). It seems that the amount of base-pairing differs between LSU and SSU sequences but that the remaining bases are divided almost equally over the loops, the bulges, and the junctions. On average, <4% of bases appear in ends and flexible regions. Figure 8
(middle and right) shows that computer predictions systematically result in too many base pairs and thus too few bases in the unpaired regions, which might account for the observed increase in variation for the NP structures. In addition, the predictions are similar for sequences with the lengths of either typical LSU or SSU sequences. There is no visible difference between the predictions of the natural and the randomized sequences, suggesting that the bias is due to the folding procedure rather than being sequence-specific. Although covariation methods, with which the annotated structures are predicted, can systematically underpredict base-pairing because they cannot detect pairing involving absolutely conserved positions, the magnitude of the change (>10% of the sequence is incorrectly predicted to be paired) is much greater than the error in the covariation structures.
|
Third, we tested whether the thermodynamic parameters affected the result. The energies for tetraloops and certain other "special" sequences used by RNAfold are calculated by using sequence databases that include rRNA sequences and might unfairly bias the structures for arbitrary sequences to resemble the structures for natural sequences in composition. However, repeating the analysis with the "4" option in RNAfold, which eliminates the contribution of tetraloop energies, did not affect the compositions of the different structural components significantly.
We next tested whether the differences between the unpaired structural components arose simply from the difference in pairing strength between AU and GC base pairs. The RNAfold program provides an option to fold sequences by using the abstract "ABCD" alphabet, in which A pairs with B and C pairs with D and in which all kinds of base pairs have the same energy parameters for pairing and stacking. Repeating the analysis by translating the sequences into the ABCD alphabet and folding with the thermodynamic parameters for AU or GC pairs (i.e., all pairs were treated as AU, or all were treated as GC) gave strikingly different compositions from normal folding, in part because GU pairs could not be incorporated in this model. However, in all cases, the loops, the bulges, and the junctions differed from each other in composition. Reassuringly, sequences in which the meanings of the bases were permuted (e.g., U might be exchanged with C) gave symmetric patterns, indicating that whichever bases are in excess over the 1:1 purine:pyrimidine ratio required for stems will be found in the unpaired regions to a similar degree to the bases that were in excess in the original composition. In other words, when all bases have the same energies, any bases in excess will be found more frequently in the unpaired regions; however, when the thermodynamic parameters are taken into account, the identity of the bases matters because of differences in pairing and stacking energies.
These results suggest that the causes of differences among bulges, loops, and junctions are not related to their properties as parts of nucleic acid sequences per se but are rather a general property of the class of formal grammars that includes non-pseudo-knotted structures when applied to arbitrary character strings. The results also indicate that the null hypothesis for studies of composition should not be that all unpaired structural components are identical in composition.
Conclusions
We have demonstrated several important features of nucleotide composition patterns within ribosomal RNA. First, there are striking similarities in the composition of the different structural categories across both ribosomal subunits and the three domains of life, despite much evolutionary divergence. Second, randomized sequences appear almost identical to natural sequences in the composition of each structural component; furthermore, they show the same patterns of variation, even though these randomized sequences are not evolved and do not have biological functions. Third, the GC content in all structural categories is positively correlated with the GC content of the ribosomal RNA overall, and randomized sequences show similar correlations to the annotated sequences. Finally, the nucleotide composition of individual structural features proves robust over multiple randomizations, since clusters of sequences with similar base compositions yield consistent clusters for each structural element.
These results for randomized sequences emerged solely from the inherent features of RNA folding, as reproduced by the dynamic programming method and thermodynamic parameters used for energy minimization in RNAfold. Our conclusions thus depend on the ability of these algorithms to provide information about arbitrary sequences: Although the predictions are far from perfect, there is no reason to believe that they are biased in ways that would give the observed patterns as an artifact. The thermodynamic parameters are derived from melting experiments on oligonucleotides (Mathews et al. 1999
), which are short sequences that are neither evolved nor biologically active. There is thus no reason to believe that the rules derived from experiments on them would apply only to biologically active sequences and not to arbitrary RNA sequences. The predictions also use special bonus energies for particular loop sequences, which are based on experimental data and supported by statistics on known RNA structures. These energies improve the predictions for natural RNA sequences that were not themselves used to derive the parameters (Mathews et al. 1999
), and are thus likely to provide the best available estimate of the structures of arbitrary sequences. Changing details of the parameters, such as eliminating the bonuses for tetraloops (which are inferred from a database of structures) did not affect our results.
The computer predictions are sufficiently accurate to capture the features we examined: The predictions of the natural sequences closely resemble the patterns observed from annotated sequences. The predictions are very accurate at specifying whether bases are paired or unpaired (Mathews et al. 1999
), suggesting that the composition of the stems is probably most accurate, although there is less accuracy in predicting the overall topology of the molecule (data not shown). The predictions are good enough to show the separation between the unpaired regions. However, this distinction is less sharp than in the annotated sequences, which might be due to some mixing of the unpaired categories.
The discovery of general rules that determine the amount of base-pairing and the nucleotide composition of a molecule will have important consequences for the accuracy of secondary structure prediction programs, such as BayesFold (Knight et al. 2004
). If the compositional preferences we have demonstrated for rRNA generalize to other molecules, we may be able to assess the plausibility of a structure by asking whether the compositional patterns comply with the specific compositional statistics, thus improving the predictions. Specifically, a structure that reproduces typical compositional biases in the different structural elements is more likely to be correct. However, the similarities between the compositions in each component of the true structure and the structures predicted by current methods suggest that the power of this approach may be limited to eliminating the more egregious mispredictions. A more promising difference is in the amount of the sequence that is assigned to each structural category, which shows clear differences between the natural and the predicted structures. We should be able to compensate for the systematic deviations in current computer predictions, especially the excess of base pairs.
Because the constraints on the compositions of each structural component and the slopes of the compositional responses of each structural component to changes in overall and genome GC content are very similar, the null model for evolutionary studies of rRNA should not be that these components behave identically but rather that compositional differences would be expected even in random sequences. Our results suggest that only parts of the rRNA are under strong selection and that most of the molecule is able to change neutrally. Testing whether other classes of RNA that are under stronger selection, such as the 5S rRNA, may reveal cases where the change in each structural component does differ from what would be observed in random sequences of the same composition (and hence the action of selection), but we see no evidence for these effects in rRNA.
| MATERIALS AND METHODS |
|---|
|
|
|---|
We obtained natural rRNA sequences, which could be used for computer predictions, by stripping out all gaps and secondary structure information from our annotated data. We created randomized versions of our annotated data by shuffling the natural sequences completely, using the Fisher-Yates shuffle algorithm as implemented in the random module of the Python standard library. In this way, all structural motifs are broken, but the overall base composition of the molecule is unaltered.
The structures associated with the rRNA sequences in the database are predicted by comparative sequence analysis. We refer to these structures as "annotated" because they are based on experimental evidence and have been compared to crystal structures. For randomized sequences there are no secondary structure models available. Because experimentally determining structures for these sequences is impossible, we used RNAfold from the Vienna RNA folding package (Hofacker et al. 1994
), which implements the Zuker folding algorithm (Zuker and Stiegler 1981
) to estimate an optimal secondary structure both for each natural sequence and for each permuted sequence.
RNAfold returns the optimal structures in dot-bracket (or Vienna) format. In order to compare the annotated structures and the computer-predicted structures, we developed an algorithm to convert the distribution format from the database into the Vienna format. Based on the helix numbering, it finds the most likely pairs of upstream and downstream helix parts. We verify the actual base-pairing and solve the matching for helix parts that are incorrectly annotated or unannotated. Pseudo-knots are discarded because the Vienna format cannot denote them, but because they comprise <2% of all base pairs in rRNA (Mathews et al. 1999
), this limitation has little effect on our results.
The database contained 21,782 sequences. About 50% of these sequences were unusable: They contained too many undetermined positions (>50), had an odd number of helix parts, contained pairing helix parts of different lengths, etc. From the remaining 50% with good data, our conversion algorithm could reliably convert 10,254 structures into dot-bracket format, which corresponded to a data loss of 0.86% of the total number of sequences (Table 1
). In our analysis, we focused on RNA from nuclear genomes: We included archaea, bacteria, and eukaryotes (263, 5530, and 3099 sequences, respectively; 8892 sequences in total).
|
Calculating and visualizing base composition
We calculated the base composition for each structural element by grouping all bases within a particular element together and counting the number of each of the four bases: U, C, A, and G. We normalized this composition vector by the number of residues in the element (N) in order to compare elements containing different numbers of bases. The base composition of any RNA sequence can be visualized in a tetrahedral unit simplex (Schultes et al. 1997
; Fig. 2
). In this unit simplex, the three pairwise combinations of bases define three orthogonal axes. For example, the amount of G + C defines a position along Chargaffs axis, where G = C and A = U. The two other axes are the purinepyrimidine axis, plotting the amount of A + G versus C + U, and the aminoketo axis, plotting the amount of A + C versus G + U. The four bases form the four vertices; sequences containing more of a particular base lie closer to the vertex for that base.
For our particular analysis, we plotted seven dots for each sequence: six for the structural elements (stem, loop, bulge, junction, end, and flexible) and one for the overall base composition of the molecule. Plotting the base compositions for many RNA sequences allowed us to see the similarities or the differences among species, structural elements, ribosomal subunits, or domains of life.
We used the program MAGE (Richardson and Richardson 1992
) to visualize the composition simplex. This program treats the three dimensions (A/N, C/N, G/N) as orthogonal axes and applies a distortion matrix to make them look like a tetrahedron. However, we could not use these distorted coordinates to calculate distances between points or samples. Therefore, we converted the coordinates by using combinations of the four bases as axes that form the orthogonal right-handed Cartesian coordinate system described above.
Testing whether samples are different
To test whether the difference in location between two samples was significant, we used Monte Carlo simulations. We compared the observed distance between two samples, i.e., the Euclidean distance between the means of the two samples, to the distribution of distances between many pairs of random samples resampled from the original data points. This technique does not depend on assumptions about the shape or variance of the underlying distributions.
To apply this technique, we first pooled the points in the two samples. Next, we randomly permuted the list of samples and divided the list into two groups that contained the same number of points as the original samples. Finally, we compared the distance between the means of the randomized samples to the distance between the means of the original samples. We repeated this 10,000 times, except when a small preliminary sample was sufficient to show that the difference was not significant. The P-value is the number of times the observed distance was greater than or equal to the benchmark divided by the number of randomizations. Any P-value
0.05 was considered significant.
Testing whether samples are similar
The Monte Carlo simulations sensitively reveal whether samples differ but cannot directly tell us which samples are similar. We needed to test whether patterns were more similar within structural categories, domains, or ribosomal subunits. For example, looking at this problem in only two dimensions, the data might be clustered as in Figure 9A
. This figure shows a situation in which the strongest similarities are within each domain rather than within each subunit, suggesting that the domain is more important in determining composition. Alternatively, the data might be clustered as in Figure 9B
, where subunit identity dominates the clustering. In the first case, the distances between points within a domain will on average be smaller than the distances between all combinations of two points. In the second case, the distances between points within a subunit will be smaller than the distances between all combinations of points. To generalize, points within a cluster will on average be closer than points chosen at random. We can compare these two populations of distances (within and between putative clusters) by using a one-tailed two-sample t-test: The lower the P-value, the greater the significance of the relationship represented by the clustering.
|
We also applied this method to confirm the visual similarities between annotated and computer-predicted structures. This gives three times as many samples as above, thus 2556 distances in the full sample. We made three subsets, each time within a subunit, domain, and structural category, but across structure type (NA, NP, and RP). Each of the subsets contained 24 distances. The distributions of distances are visualized with histograms and compared with a one-tailed two-sample t-test.
| ACKNOWLEDGMENTS |
|---|
| Footnotes |
|---|
Received August 3, 2005; accepted October 15, 2005.
| REFERENCES |
|---|
|
|
|---|
Bernstein, F., Koetzle, T., Williams, G., Meyer Jr., E., Brice, M., Rodgers, J., Kennard, O., Shimanouchi, T., and Tasumi, M. 1977. The Protein Data Bank: A computer-based archival file for macromolecular structures. J. Mol. Biol. 112: 535542.[Medline]
Cannone, J., Subramanian, S., Schnare, M., Collett, J., DSouza, L., Du, Y., Feng, B., Lin, N., Madabusi, L., Muller, K., et al. 2002. The Comparative RNA Web (CRW) Site: An online database of comparative sequence and structure information for ribosomal, intron, and other RNAs. BMC Bioinformatics 3: 2.[CrossRef][Medline]
Cate, J., Gooding, A., Podell, E., Zhou, K., Golden, B., Szewczak, A., Kundrot, C., Cech, T., and Doudna, J. 1996. RNA tertiary structure mediation by adenosine platforms. Science 273: 1696 1699.
Doherty, E.A., Batey, R.T., Masquida, B., and Doudna, J.A. 2001. A universal mode of helix packing in RNA. Nat. Struct. Biol. 8: 339 343.[CrossRef][Medline]
Elson, D. and Chargaff, E. 1955. Evidence of common regularities in the composition of pentose nucleic acids. Biochim. Biophys. Acta 17: 367376.[Medline]
Galtier, N. and Lobry, J. 1997. Relationships between genomic G+C content, RNA secondary structures, and optimal growth temperature in prokaryotes. J. Mol. Evol. 44: 632636.[CrossRef][Medline]
Gutell, R.R., Cannone, J.J., Shang, Z., Du, Y., and Serra, M.J. 2000. A story: Unpaired adenosine bases in ribosomal RNAs. J. Mol. Biol. 304: 335354.[CrossRef][Medline]
Gutell, R., Lee, J., and Cannone, J. 2002. The accuracy of ribosomal RNA comparative structure models. Curr. Opin. Struct. Biol. 12: 301310.[CrossRef][Medline]
Guy, L. and Roten, C. 2004. Genometric analyses of the organization of circular chromosomes: A universal pressure determines the direction of ribosomal RNA genes transcription relative to chromosome replication. Gene 340: 4552.[CrossRef][Medline]
Hofacker, I., Fontana, W., Stadler, P., Bonhoeffer, L., Tacker, M., and Schuster, P. 1994. Fast folding and comparison of RNA secondary structures. Monatsh. Chem. 125: 167188.[CrossRef]
Knight, R., Freeland, S., and Landweber, L. 2001. A simple model based on mutation and selection explains trends in codon and amino-acid usage and GC composition within and across genomes. Genome Biol. 2: http://genomebiology.com.
Knight, R., Birmingham, A., and Yarus, M. 2004. Bayesfold: Rational 2° folds that combine thermodynamic, covariation, and chemical data for aligned RNA sequences. RNA 10: 13231336.
Lao, P. and Forsdyke, D. 2000. Thermophilic bacteria strictly obey Szybalskis transcription direction rule and politely purine-load RNAs with both adenine and guanine. Genome Res. 10: 228 236.
Lescoute, A., Leontis, N.B., Massire, C., and Westhof, E. 2005. Recurrent structural RNA motifs, isostericity matrices, and sequence alignments. Nucleic Acids Res. 33: 23952409.
Lobry, J. 1997. Influence of genomic G+C content on average aminoacid composition of proteins from 59 bacterial species. Gene 205: 309316.[CrossRef][Medline]
Lobry, J. and Sueoka, N. 2002. Asymmetric directional mutation pressures in bacteria. Genome Biol. 3: http://genomebiology.com.
Mathews, D.H., Sabina, J., Zuker, M., and Turner, D. 1999. Expanded sequence dependence of thermodynamic parameters improves prediction of RNA secondary structure. J. Mol. Biol. 288: 911940.[CrossRef][Medline]
Molinaro, M. and Tinoco Jr., I. 1995. Use of ultra stable UNCG tetraloop hairpins to fold RNA structures: Thermodynamic and spectroscopic applications. Nucleic Acids Res. 23: 30563063.
Muto,A. and Osawa, S. 1987. The guanine and cytosine content of genomic DNA and bacterial evolution. Proc. Natl. Acad. Sci. 84: 166169.
Nissen, P., Ippolito, J., Ban, N., Moore, P., and Steitz, T. 2001. RNA tertiary interactions in the large ribosomal subunit: The A-minor motif. Proc. Natl. Acad. Sci. 98: 48994903.