|
|
||||||||
BIOINFORMATICS |
1 Laboratory for Bioinformatics, Institute for Advanced Biosciences, Keio University, Fujisawa, Kanagawa 252-8520, Japan
2 Department of Environmental Information, Keio University, Fujisawa, Kanagawa 252-8520, Japan
Reprint requests to: Takanori Washio, Laboratory for Bioinformatics, Institute for Advanced Biosciences, Keio University, Fujisawa, Kanagawa 252-8520, Japan; e-mail: washy{at}sfc.keio.ac.jp; fax: +81 (466) 47-5099.
| ABSTRACT |
|---|
|
|
|---|
Keywords: alternative splicing; exonic splicing enhancers; ESEs; splice-site strength
| INTRODUCTION |
|---|
|
|
|---|
Numerous studies have been reported on the consequences of single nucleotide mutations in donor and acceptor site motifs in pre-mRNA splicing. In the human ATP7A gene, for example, mutation in the invariant donor splice site of a constitutive exon causes complete skipping of the exon, which leads to severe Menke disease (Moller et al. 2000
). Similarly, information-theory-based analysis of splice sites in the human XPC gene has indicated that a strong acceptor site hosts few alternative splicing events whereas a weak acceptor site has frequent exon skipping (Khan et al. 2002
). Another recently reported study, however, has indicated that donor sites and acceptor sites themselves contain insufficient information for accurate splicing in higher eukaryotes (Lim and Burge 2001
). Therefore, in addition to splice-site strength, it is necessary to consider additional auxiliary factors for an understanding of alternative splicing regulation.
SR proteins and exonic splicing enhancers (ESEs) play an important role in the regulation of alternative splicing. An SR protein binds to an ESE through its RNA-recognition motifs (RRM) and contacts the components of a spliceo-some through its RS domain. In the human SMN2 gene, for example, point mutations to ESE motifs on exon 7 can disrupt an ASF/SF2 (alternative splicing factor/splicing factor 2)-dependent regulation of alternative splicing and lead to inefficient inclusion of exon 7 (Cartegni and Krainer 2002
). Although many SR-protein-binding site motifs have been identified by selective evolution of ligands by exponential enrichment (SELEX), the distribution of those regulatory sequences in alternative and constitutive exons has remained uncertain.
Although both splice-site strength and exonic regulatory sequences have been reported to be involved in regulation of alternative splicing, most of these studies have been performed on individual genes. Thus, it has been impossible to determine whether the reported feature is (1) specific to the reported gene, (2) specific to the reported species, or (3) universal to all species. On the other hand, large-scale bioinformatics studies have contributed greatly to a general understanding of alternative splicing. Computational and statistical analyses of human genes have identified 10 ESE motifs that all showed enhancer activity in humans (Fairbrother et al. 2002
). Another computational analysis of mouse transcriptome has identified sequence motifs enriched around the donor and acceptor sites in both constitutive and alternative exons (Zavolan et al. 2002
). Thus, large-scale bioinformatics analyses of more than 1000 genes can present a broad view of alternative splicing regulatory mechanisms.
In the present study, we have compared alternative and constitutive exons in terms of splice-site strength and frequency of potential regulatory sequences. We have applied an information-theory-based approach for computing splice-site strength, and a phylogenetic footprinting approach for predicting potential regulatory sequences. In addition, we have compared these features among mammals (Homo sapiens and Mus musculus), plants (Arabidopsis thaliana and Oryza sativa), and Drosophila melanogaster. Based on the results from this comparative analysis of alternative and constitutive exons, we have delineated a potential model for alternative splicing regulatory mechanisms.
| RESULTS |
|---|
|
|
|---|
|
|
|
Information contents of positions -13 to -8 upstream of the acceptor site were higher in mammals than in plants and D. melanogaster (Supplemental Fig. 1, available online at http://www.bioinfo.sfc.keio.ac.jp/research/intron/). Information contents between positions +3 and +6 downstream from the donor site were higher in mammals and D. melanogaster than in plants. The chi-squared (p < 0.05, df = 8) test indicated that plants had significantly higher information contents between positions +7 and +15 downstream from the donor site than other species.
To compare splice-site strength in constitutive and alternative exons, we applied the individual information (Ri) technique. Individual information contents (Ri, bits) can be computed by adding up information content of given nucleotides from individual positions, using the weight matrix generated from the frequency of each nucleotide at each position. We computed the individual information content of each individual splice site by summing information contents of positions -3 to +7 for donor sites and positions -13 to +1 for acceptor sites. All of the weight matrices and individual information contents of all splice sites are available as supplementary material online at http://www.bioinfo.sfc.keio.ac.jp/research/intron/.
Based on the individual information content of all splice sites, we then computed the average individual information contents (Ri) of splice sites in constitutive and alternative exons (Fig. 2
; Table 3
). Students unpaired t test (p < 0.05, two-sided) indicated that average individual information contents of alternative splice sites were significantly lower than those of constitutive splice sites in H. sapiens, M. musculus, O. sativa, and D. melanogaster. Although we observed comparatively low information contents of A. thaliana alternative exons, the number of alternative exons (25 exons) was insufficient to draw a statistically significant conclusion.
|
|
These differences in average individual information contents were further assessed with a method developed by Shapiro and Senapathy; they have developed a method to score splice-site strength, based on percentage of each nucleotide at each position (Shapiro and Senapathy 1987
). We applied Shapiros method to positions -3 to +7 for donor sites and positions -13 to +1 for acceptor sites, computed Shapiros score of individual splice sites, and compared the average score of splice sites in constitutive and alternative exons (Table 3
). Differences between these two averages were assessed with Students unpaired t test (p < 0.05, two-sided). Because we obtained exactly the same results as with the individual information technique, we further supported our two main observations that alternative exons have significantly weaker splice sites than constitutive exons and that plants have significantly weaker splice sites than mammals.
Distribution of nucleotide substitution rates in the 5' end/3' end of evolutionarily conserved exons (constitutive + alternative exons)
We applied a phylogenetic footprinting approach for analyzing potential regulatory sequences in exons. Phylogenetic footprinting is one of the well-known approaches to predicting regulatory sequence, by which unusually well-conserved sequences among a set of orthologous genes are extracted as candidates for functional regulatory elements (Blanchette and Tompa 2002
). To collect orthologous genes, we mapped M. musculus and O. sativa full-length cDNA clones to H. sapiens and Triticum aestivum TIGR gene indices and full-length cDNAs, respectively. A. thaliana and D. melanogaster were excluded from further analyses, because of the small sample size of alternative exons. As a result, we extracted 1008 constitutive exons and 413 alternative exons from the M. musculusH. sapiens comparison, and 275 constitutive exons and 96 alternative exons from the O. sativaT. aestivum comparison. There were originally 6659 constitutive exons and 1054 alternative exons in M. musculus and 1241 constitutive exons and 558 alternative exons in O. sativa. Numbers of constitutive and alternative exons were substantially reduced, because orthologous genes were not found in corresponding species or the exons that were not evolutionarily conserved, and those that had the different splice-site boundaries were excluded from data sets. All of the evolutionarily conserved constitutive and alternative exons are available as supplemental material online.
Evolutionarily conserved alternative exons and constitutive exons in orthologous genes were aligned using CLUSTAL-W (Thompson et al. 1994
). The results of comparisons between M. musculus and H. sapiens, and O. sativa and T. aestivum (wheat) are shown in Figure 3
as distributions of nucleotide substitution rates at the 5' ends and 3' ends of evolutionarily conserved exons (constitutive and alternative). Note that data sets of constitutive and alternative exons were combined in order to see the general feature of M. musculus and O. sativa evolutionarily conserved exons. Figure 3
shows a "spike" pattern, in which nucleotide substitution rates were considerably higher at 3n bp from acceptor sites and at 3n + 1 bp from donor sites than others; this might be due to synonymous substitutions and non-synonymous substitutions, because the length of an exon is a multiple of three in most cases (Tomita et al. 1996
). Based on this observation, we additionally drew the average nucleotide substitution rate of the position 3n, 3n + 1, and 3n + 2 of the "spikes" in Figure 3
.
|
Comparison of potential regulatory sequences on alternative exons and constitutive exons
Constitutive exons and alternative exons were then compared in terms of nucleotide substitution rates. A KolmogorovSmirnov test indicated that the nucleotide substitution rates of alternative exons were significantly lower than those of constitutive exons in M. musculus (Fig. 4
). Students unpaired t test (p < 0.05, two-sided) verified the significantly low nucleotide substitution rates of alternative exons in M. musculus. Exactly the same feature was observed in the O. sativaT. aestivum comparison, but the difference was not statistically significant because of the small sample size (96 data sets) of alternative exons (data not shown). Based on these results, we observed that alternative exons contain more evolutionarily conserved sequences than do constitutive exons.
|
|
In M. musculus, purine-rich motifs were frequently observed to exist only in alternative exons, whereas only a few such motifs occurred in constitutive exons (Table 4
). In M. musculus, constitutive exons had more than twice as many data sets (1008 exons) as alternative exons (413 exons), but a larger number of potential regulatory sequences was extracted from alternative than from constitutive exons. To compute the level of conservation of these potential regulatory sequences, we divided the number of evolutionarily conserved heptamers by the total number of heptamers observed in the exons. Notably, the level of conservation of these heptamers was larger in alternative exons than in constitutive exons, both in M. musculus and O. sativa.
| DISCUSSION |
|---|
|
|
|---|
Stamm et al. (1994)
have created a database of alternatively spliced exons in neurons and indicated that donor site consensus (CAG|GTAAGT) is present 40% less often in neuron-specific alternative exons than in constitutive exons. This study seems to suggest that alternative exons possess weaker splice sites than constitutive exons do. However, a later study by Rogan and Schneider (1995)
stated that splice-site sequences that deviate from the consensus do not necessarily produce significantly lower amounts of spliced mRNA. In contrast, the information-theory-based method does not destroy subtle details of splice-site sequences. Furthermore, the total information at the splice sites can be easily determined by use of the individual information (Ri) technique, which adds up the information from individual positions.
The individual information technique and Shapiros method were applied to compare the splice-site strength in constitutive and alternative exons. Both methods obtained exactly the same results, that alternative exons have significantly weaker splice sites than constitutive exons. Because alternative exons contain weaker splice sites than constitutive exons, weak splice sites must be one of the fundamental factors of alternative splicing regulation and a universal characteristic of alternative exons in H. sapiens, M. musculus, A. thaliana, O. sativa, and D. melanogaster.
3' ends of exons are more evolutionarily conserved
We observed a gradual increase of nucleotide substitution rates with increasing distance from splice sites in both M. musculus and H. sapiens but not in O. sativa or T. aestivum (Fig. 3
). As indicated by Blanchette and Tompa (2002)
, it is highly possible that those evolutionarily conserved sequences among a set of orthologous genes are potential regulatory sequences. A number of experimental studies have identified potential regulatory sequences of pre-mRNA splicing, mostly in individual genes. Although these studies may have provided an excellent novel feature of those potential regulatory sequences, most of them have not shown where those regulatory sequences are most likely to be found.
Several statistical analyses of our results validated that nucleotide substitution rates at positions -25 to -5 from the donor site are significantly lower than at other regions in the M. musculusH. sapiens comparison (Fig. 3A
). It is reasonable to suggest that evolutionarily conserved sequences, possibly some regulatory sequences, are more likely to be found at the 3' end of an exon. Although we adopted the same approach and statistical validation methods in the O. sativaT. aestivum comparison, no remarkable feature was observed in the distribution of nucleotide substitution rates (Fig. 3B
). Here, we have observed potential differences between mammals (M. musculus) and plants (O. sativa) in the distribution of evolutionarily conserved sequences, possibly including some regulatory sequences; potential regulatory sequences are more likely to be found at the 3' end of exons in mammals, but not in plants.
Alternative exons have more evolutionarily conserved sequences than constitutive exons
In the M. musculusH. sapiens comparison, both the KS test and t test indicated that alternative exons contain more evolutionarily conserved sequences than constitutive exons (Fig. 4
). This observation has led to our hypothesis that alternative exons have more potential regulatory sequences than constitutive exons, because selective pressure causes functional elements to evolve at a slower rate than that of nonfunctional sequences (Blanchette and Tompa 2002
). A recent interesting finding on the evolution of alternative exons can further support our hypothesis.
Modrek and Lee (2003)
have indicated that the inclusion level of an exon in H. sapiens ESTs is highly correlated with that in M. musculus, and implied that the evolutionarily conserved alternative exons are similarly regulated in both organisms. This observation may further support our hypothesis that alternative exons have more potential regulatory sequences than constitutive exons; if the evolutionarily conserved exons are similarly regulated, potential regulatory sequences on the exons are more likely to be conserved as well.
Modrek and Lee (2003)
have also suggested that when an exon is present in the ortholog of one genome but not the other, there could be exon creation or exon loss during the evolution of that genome; thus alternative splicing is greatly associated with the exon creation and/or exon loss during the evolution. Another study by Sorek et al. (2004)
has suggested that the conservation of alternative exons in more than one species suggests the functional importance of the alternative exons, in comparison with nonconserved alternative exons. We extracted evolutionarily conserved alternative exons from our M. musculusH. sapiens comparison and our O. sativaT. aestivum comparison; as such, those evolutionarily conserved alternative exons must have been created before the branching of M. musculus and H. sapiens or O. sativa and T. aestivum and most likely have some functional importance.
Alternative exons have more potential regulatory sequences than constitutive exons
We compiled large numbers of both constitutive and alternative exons and discovered that alternative exons have more evolutionarily conserved sequences than constitutive exons (Fig. 4
), and that alternative exons contain more purine-rich potential regulatory sequences than constitutive exons (Table 4
). It must be admitted that 13 cases out of a data set of 413 M. musculus alternative exons might be rather few. However, we believe that this small number is due to our strict requirements, that we only included the heptamers with exact matches. It is well known that although SR protein have distinct RNA-binding specificities, the consensus sequences that they recognize are rather degenerate (Graveley 2000
). As such, if we included weaker matches, the number may substantially increase. In addition, it is unlikely that those heptamers were extracted by a probabilistic chance, because we used a second-order Markov model to compute the expected value of the heptamers, and extracted the heptamers whose O/E value is greater than 1.5. Although a number of potential SR-protein-binding sites have been compiled in literature, there has been no knowledge of how those sequences are distributed between alternative and constitutive exons. Thus, it has been difficult to present a broad view of potential regulatory sequences in the regulation of alternative splicing.
In the M. musculusH. sapiens comparison, both the KS test and t test indicated that alternative exons contain more evolutionarily conserved sequences than constitutive exons (Fig. 4
). This observation has led to our hypothesis that alternative exons have more potential regulatory sequences than constitutive exons. To assess this hypothesis, we extracted potential regulatory sequences in both alternative and constitutive exons (Table 4
). In M. musculus, constitutive exons had more than twice as many data sets (1008 exons) as alternative exons (413 exons), but a larger number of potential regulatory sequences was extracted from alternative exons. It is also worth noting that the level of conservation of those heptamers was greater in alternative exons than in constitutive exons. Alternative exons in M. musculus contained a number of purine-rich motifs; with a few exceptions, most ESE motifs are reported to be purine rich. Notably, some of these predicted potential regulatory sequences on alternative exons have a high sequence similarity with the binding site for ASF/SF2, which is RGAAGAAC (Tacke and Manley 1995
), with the binding site for Tra2, which is GAA repeats (Tacke et al. 1998
), and with predicted RESCUE-ESE motifs, which is GAAGAA (Fairbrother et al. 2002
). Only a few purine motifs were found in constitutive exons. In O. sativa, despite the small sample size (96 alternative exons), we also observed a lower nucleotide substitution rate in alternative than in constitutive exons and a higher level of conservation of the heptamers in alternative exons than in constitutive exons (Table 4
). Based on these results, we concluded that another important factor of alternative splicing is to have more purine-rich regulatory sequences than are present in constitutive exons.
Plantmammal comparison of various splicing regulatory factors
In our comparative analyses of plants and mammals, our two main observations were that plants have significantly higher information contents between positions +7 and +15 downstream from the donor site than other species (Supplemental Fig. 1, available online at http://www.bioinfo.sfc.keio.ac.jp/research/intron/), and that plants have significantly weaker splice sites than mammals (Fig. 2
). McCullough et al. (1993)
have suggested that the donor sites located at transition regions from GC- to AT-rich sequences are preferentially selected in the pea RBCS3A gene. Again, this reported characteristic on plant introns was obtained from experimental validation of only one pea gene; thus, there has been no way to distinguish if this characteristic is (1) specific to the RBCS3A gene, (2) specific to peas, or (3) universal in plants. Furthermore, the fundamental role of this compositional bias in plant splicing has remained unclear.
Our comparative analysis of plants and mammals may allow us to present a model for plant splicing. The higher information contents in positions +7 to +15 downstream from the donor site possibly represents the compositional bias (i.e., AT richness) of plant introns (Supplemental Fig. 1, available online at http://www.bioinfo.sfc.keio.ac.jp/research/intron/). Also, the lower information contents of positions +3 to +6 downstream from the donor site and those of positions -13 to 8 upstream of the acceptor site possibly have resulted in weak donor and acceptor sites, respectively (Fig. 2
; Supplemental Fig. 1, available online at http://www.bioinfo.sfc.keio.ac.jp/research/intron/). Those observations have been obtained from comparative analyses of more than 1000 splice sites from five different species, and are verified by solid statistical tests. Thus, we can suggest a model for plant splicing: Plants have a strong compositional bias in their introns to support the relatively weaker splice sites.
In addition to the strong compositional bias and weaker splice sites, we have observed potential differences between mammals (M. musculus) and plants (O. sativa) in the distribution of evolutionarily conserved sequences; we observed a gradual increase of nucleotide substitution rates with increasing distance from splice sites in M. musculus but not in O. sativa (Fig. 3
). This result might be due to a compositional bias of plant introns, that the splice sites located at transition regions from GC- to AT-rich sequences are preferentially selected in plant splicing. Also, although purine-rich motifs were found only in the alternative exons as potential regulatory sequences in M. musculus, several purine-rich motifs were found in both the constitutive and alternative exons in O. sativa (Table 4
). This result is most likely due to the rather small sample size (96 alternative exons). However, these features of potential regulatory sequences might be potential differences between mammals and plants, because there are several plant-specific requirements for pre-mRNA splicing, such as weaker splice sites and a compositional bias in their introns.
A "weaker/more" combinatorial model of alternative splicing regulatory mechanisms
Our main discoveries have been that alternative exons have weaker splice sites than constitutive exons, and that alternative exons have more potential regulatory sequences than constitutive exons. These observations raise one important question: Is the weakness of the splice sites connected to the greater potentiality for regulatory sequences? A recent experimental study reported a fascinating observation regarding this issue. Cystic fibrosis transmembrane regulator (CFTR) has an alternative exon (exon 12) whose acceptor site has a relatively weak consensus (AAG|GTATGA). Pagani et al. (2003)
have shown that a point mutation at position +4 downstream from the consensus can strengthen the acceptor site, which leads the alternative exon to express constitutively. In addition, they have confirmed that a point mutation in the exonic regulatory sequence (GGATAC) of the alternative exon results in a severe splicing defect, which, surprisingly, increases the exclusion of the alternative exons from the mRNA transcript. Pagani et al.s study indicates that both weak splice sites and exonic regulatory sequences on the alternative exon are indispensable to the alternative regulation of exon 12 in the CFTR gene. Once again, this excellent observation is obtained from just one individual gene and does not present a broad view of alternative splicing regulation.
We have indicated that alternative exons have (1) weaker splice sites and (2) more potential exonic regulatory sequences than constitutive exons. It is reasonable to suggest that the fundamental role of weak splice sites in alternative exons is to have the flexibility to be included or excluded from mature mRNA transcripts; if the splice sites were stronger, the exon would lose this flexibility and express constitutively. Also, such alternative exons may have more potential regulatory sequences to be regulated in a cell-type specific manner because SR proteins, which are known to bind to such regulatory sequences, also express in a cell-type-specific manner. Taking together all of these observations, we propose a "weaker/more combinatorial model" as a potential model of alternative splicing regulatory mechanisms: This model suggests that alternative exons contain weaker splice sites to be regulated alternatively by more potential regulatory sequences on the exons.
We applied our weaker/more combinatorial model of alternative splicing regulatory mechanisms to M. musculus and O. sativa genes. Figure 5A
illustrates the gene structure and multiple sequence alignment of the alternative exon for M. musculus immunoglobulin, conserved among M. musculus (mouse), H. sapiens (human), chicken, and cattle. Immunoglobulin is well known to function as an antibody, and has five major classes with distinct functions in immune response. We observed potential regulatory sequences of alternative splicing, AAGAAGA and AGAAGAA, both of which have been evolutionarily conserved more frequently than expected in other alternative exons of M. musculus (boxed in Fig. 5A
). As we have described above, the purine-rich motif AGAAGAA has remarkably high sequence similarity with the binding site of ASF/SF2. In addition, information contents of the donor site (9.02 bits) and acceptor site (-1.30 bits) in the alternative exon are lower than those of the donor site (10.42 bits) and acceptor site (9.98 bits) in the adjacent constitutive exons.
|
Our large-scale comparative analyses and statistical validation of more than 1000 alternative exons have provided substantial evidence for our weaker/more combinatorial model of alternative splicing regulation. Taking together all the observations above and solid statistical validations of our results, we can conclude that alternative exons contain weaker splice sites in order to be regulated alternatively by potential regulatory sequences, which are found more frequently in alternative exons than in constitutive exons.
For the past several years, the regulatory functions of many splicing regulatory sequences in individual genes have only been anecdotally reported. We have applied, for the first time, comparative analyses of various transcriptomes to delineate a potential model of alternative splicing regulatory mechanisms. Our bioinformatics approach may thus represent the best model for transcriptome analysis of alternative splicing.
| MATERIALS AND METHODS |
|---|
|
|
|---|
|
Because it is difficult to identify where potential regulatory sequences exist in exons with different donor/acceptor sites and thus impossible to compare them with constitutive exons, we focused only on cassette exons for our analysis (Table 2
); we defined an alternative exon as a "cassette exon" and a constitutive exon as "an internal exon that is conserved across the entire transcript in a splice variant cluster".
Information content of constitutive and alternative splice sites
Information content on splice sites was calculated based on Shannons information theory (Stephens and Schneider 1992
). We computed the "uncertainty" at a position by the following equation:
![]() |
where f(b, l) is the probability of base b at position l.
To compare the average information contents of constitutive and alternative splice sites, we applied the individual information
(Ri) technique (Schneider 1997
). We first generated an individual information weight matrix from the frequencies of each nucleotide at each position for each of five species. All of the weight matrices are available as supplemental online material. The individual information weight matrix can be calculated by the following equation:
![]() |
The information content of each individual splice site was calculated by summing Riw(b,l) of the specified positions. We added Riw(b,l) of positions -3 to +7 for donor sites, because information contents were observed to be saturated downstream from position -3 and upstream of position +7 (Supplemental Fig. 1, available online at http://www.bioinfo.sfc.keio.ac.jp/research/intron/). Similarly, we summed Riw(b,l) of positions -13 to +1 for acceptor sites to observe the potential differences between plants and mammals downstream from position -13, and because information contents seemed to be saturated downstream from position +1 (Supplemental Fig. 1, available online at http://www.bioinfo.sfc.keio.ac.jp/research/intron/). All individual information contents of all splice sites are available as supplemental online material. Using the information contents of each individual splice site, we then computed the average individual information content of alternative and constitutive splice sites (Fig. 2
). Differences between these two averages were assessed with Students unpaired t test (p < 0.05, two-sided).
Shapiros score of constitutive and alternative splice sites
Shapiro and Senapathy (1987)
have developed a method to score the strength of a splice site based on percentages of each nucleotide at each position. Shapiros score of donor site is 100 * (t - min)/ (max - min), where t is the sum of percentages at positions -3 to +7, min is the sum of the lowest percentages at positions -3 to +7, and max is the sum of the highest percentages at positions -3 to +7. On the other hand, Shapiros score of acceptor site is 100 * ((t1 - l1)/(h1 - l1) + (t2 - l2)/(h2 - l2))/2, where t1 is the sum of the best 8 of 10 percentages at positions -13 to -4, l1 is the sum of the lowest 8 of 10 percentages at position -13 to -4, h1 is the sum of the highest 8 of 10 percentages at positions -13 to -4, t2 is the sum of percentages at positions -3 to +1, l2 is the sum of the lowest percentages at positions -3 to +1, and h2 is the sum of the highest percentages at positions -3 to +1.
All of the weight matrices and individual information contents of all splice sites are available as supplemental online material.
Prediction of potential regulatory sequences in alternative exons
Phylogenetic footprinting is one of the well-known approaches to predicting regulatory sequence, by which unusually well-conserved sequences among a set of orthologous genes are extracted as candidates for functional regulatory elements (Blanchette and Tompa 2002
). This method has been broadly applied to predict potential regulatory sequences, including novel functional sequence motifs in the promoter region (Cliften et al. 2003
) and transcription factor binding sites (McCue et al. 2002
). To collect orthologous genes, we retrieved cattle, dog, human, and pig genes for M. musculus and barley, maize, and wheat for O. sativa. (Note from Fig. 5
that we used an additional species, chicken for M. musculus and sorghum for O. sativa, for multiple sequence alignment in our examples.) We mapped all M. musculus and O. sativa full-length cDNA clones to TIGR gene indices (http:// www.tgi.org/tdb/tgi/) by use of BLAST (E-value: E-50 or less). We then extracted evolutionarily conserved alternative exons and constitutive exons in orthologous genes and aligned the extracted exons using CLUSTAL-W (Thompson et al. 1994
). Using the alignment data, we computed the nucleotide substitution rate at each position for all exons (constitutive + alternative).
Both a reasonable amount of evolutionary distance and a sufficient number of data sets are necessary to apply a phylogenetic footprinting approach to prediction of functional regulatory sequences; hence, we chose M. musculusH. sapiens and O. sativa T. aestivum (wheat) comparisons, which had the largest number of evolutionarily conserved exons available for both alternative and constitutive exons. To access the differences in the distribution of nucleotide substitution rates (Fig. 3
), we first divided an exon into four regions: positions +3 to +25 from the acceptor site, positions +26 to +50 from the acceptor site, positions -50 to -26 from the donor site, and positions -25 to -5 from the donor site. We excluded positions +1 and +2 from the acceptor site and -4 to -1 from the donor site to avoid possible bias of splice-site consensus sequences. The expected rate of nucleotide substitutions was computed by calculating the average nucleotide substitution rates at all regions. In addition, Students unpaired t test was conducted to average the nucleotide substitution rates in given regions to further validate our observations.
We then compared the nucleotide substitution rates per exon and performed a KolmogorovSmirnov test and Students unpaired t test to see if any differences existed between the nucleotide substitution rate histograms for constitutive and alternative exons (Fig. 4
); the level of significance was p < 0.05 (two-sided). Noting that alternative exons contain more conserved sequences, we extracted the evolutionarily conserved sequences whose lengths ranged from 7 bp to 20 bp from the alignment results of alternative exons and constitutive exons. Maximum length was set to 20 bp to avoid extracting false positives from unusually long conserved sequences. The minimum length was set to 7 bp because known exonic splicing enhancer motifs, identified experimentally, have an average length of approximately 7 bp.
We then computed the expected value of all possible combinations of 7-bp motifs in the extracted sequences. To consider codon bias in the coding region, we used a second-order Markov model and computed the expected value by the following equation: expected number of GATCATC was n(G) * p(T|GA) * p(C|AT) * p(A|TC) * p(T|CA) * p(C|AT), where n(G) is the number of nucleotide Gs and p(C|AT) is the probability that nucleotide C comes after dinucleotide AT. The five most frequently observed sequence motifs were extracted as candidates for potential regulatory sequences for alternative splicing. To compute the level of conservation of these potential regulatory sequences, we divided the number of evolutionarily conserved heptamers by the total number of heptamers observed in the exons. For example, heptamer CTGGAGC was observed in 23 M. musculus alternative exons, 13 of which are perfectly conserved in H. sapiens; the level of conservation of the heptamers was 13/25, which is 56.5%.
| ACKNOWLEDGMENTS |
|---|
Shimamoto and Assistant Professor Masayuki Isshiki of the Laboratory of Plant Molecular Genetics, Nara Institute of Science and Technology, for many useful discussions, especially of O. sativa alternative splicing. This work was supported by Japans Ministry of Agriculture, Forestry, and Fisheries (Rice Genome Project SY-1104), and by the 21st Century COE Program of Japans Ministry of Education, Culture, Sport, Science and Technology.
The publication costs of this article were defrayed in part by payment of page charges. This article must therefore be hereby marked "advertisement" in accordance with 18 USC section 1734 solely to indicate this fact.
| Footnotes |
|---|
Article and publication are at http://www.rnajournal.org/cgi/doi/10.1261/rna.5221604.
Received October 31, 2003; accepted April 21, 2004.
| REFERENCES |
|---|
|
|
|---|
Altschul, S.F., Gish, W., Miller, W., Myers, E.W., and Lipman, D.J. 1990. Basic local alignment search tool. J. Mol. Biol. 215: 403410.[CrossRef][Medline]
Blanchette, M. and Tompa, M. 2002. Discovery of regulatory elements by a computational method for phylogenetic footprinting. Genome Res. 12: 739748.
Cartegni, L. and Krainer, A.R. 2002. Disruption of an SF2/ASF-dependent exonic splicing enhancer in SMN2 causes spinal muscular atrophy in the absence of SMN1. Nat. Genet. 30: 377384.[CrossRef][Medline]
Claverie, J.M. 2001. Gene number. What if there are only 30,000 human genes? Science 291: 12551257.
Cliften, P., Sudarsanam, P., Desikan, A., Fulton, L., Fulton, B., Majors, J., Waterston, R., Cohen, B.A., and Johnston, M. 2003. Finding functional features in Saccharomyces genomes by phylogenetic footprinting. Science 301: 7176.
Fairbrother, W.G., Yeh, R.F., Sharp, P.A., and Burge, C.B. 2002. Predictive identification of exonic splicing enhancers in human genes. Science 297: 10071013.
Florea, L., Hartzell, G., Zhang, Z., Rubin, G.M., and Miller, W. 1998. A computer program for aligning a cDNA sequence with a genomic DNA sequence. Genome Res. 8: 967974.
Graveley, B.R. 2000. Sorting out the complexity of SR protein functions. RNA 6: 11971211.[CrossRef][Medline]
Khan, S.G., Muniz-Medina, V., Shahlavi, T., Baker, C.C., Inui, H., Ueda, T., Emmert, S., Schneider, T.D., and Kraemer, K.H. 2002. The human XPC DNA repair gene: Arrangement, splice site information content and influence of a single nucleotide polymorphism in a splice acceptor site on alternative splicing and function. Nucleic Acids Res. 30: 36243631.
Kochiwa, H., Suzuki, R., Washio, T., Saito, R., Bono, H., Carninci, P., Okazaki, Y., Miki, R., Hayashizaki, Y., and Tomita, M. 2002. Inferring alternative splicing patterns in mouse from a full-length cDNA library and microarray data. Genome Res. 12: 12861293.
Lim, L.P. and Burge, CB. 2001. A computational analysis of sequence features involved in recognition of short introns. Proc. Natl. Acad. Sci. 98: 1119311198.
McCue, L.A., Thompson, W., Carmack, C.S., and Lawrence, C.E. 2002. Factors influencing the identification of transcription factor binding sites by cross-species comparison. Genome Res. 12: 15231532.
McCullough, A.J., Lou, H., and Schuler, M.A. 1993. Factors affecting authentic 5' splice site selection in plant nuclei. Mol. Cell. Biol. 13: 13231331.
McKeown, M. 1992. Alternative mRNA splicing. Annu. Rev. Cell Biol. 8:133155.[CrossRef][Medline]
Michelet, B. and Boutry, M. 1995. The plasma membrane H+-ATPase (a highly regulated enzyme with multiple physiological functions). Plant Physiol. 108: 16.[Medline]
Mironov, A.A., Fickett, J.W., and Gelfand, M.S. 1999. Frequent alternative splicing of human genes. Genome Res. 9: 12881293.
Modrek, B. and Lee, C. 2002. A genomic view of alternative splicing. Nat. Genet. 30: 1319.[CrossRef][Medline]
. 2003. Alternative splicing in the human, mouse and rat genomes is associated with an increased frequency of exon creation and/or loss. Nat. Genet. 34: 177180.[CrossRef][Medline]
Moller, L.B., Tumer, Z., Lund, C., Petersen, C., Cole, T., Hanusch, R., Seidel, J., Jensen, L.R., and Horn, N. 2000. Similar splice-site mutations of the ATP7A gene lead to different phenotypes: Classical Menkes disease or occipital horn syndrome. Am. J. Hum. Genet. 66: 12111220.[CrossRef][Medline]
Pagani, F., Stuani, C., Tzetis, M., Kanavakis, E., Efthymiadou, A., Doudounakis, S., Casals, T., and Baralle, F.E. 2003. New type of disease causing mutations: The example of the composite exonic regulatory elements of splicing in CFTR exon 12. Hum. Mol. Genet. 12: 11111120.
Rogan, P.K. and Schneider, T.D. 1995. Using information content and base frequencies to distinguish mutations from genetic polymorphisms in splice junction recognition sites. Hum. Mutat. 6: 7476.[CrossRef][Medline]
Schneider, T.D. 1997. Information content of individual genetic sequences. J. Theor. Biol. 189: 427441.[CrossRef][Medline]
Seki, M., Narusaka, M., Kamiya, A., Ishida, J., Satou, M., Sakurai, T., Nakajima, M., Enju, A., Akiyama, K., Oono, Y., et al. 2002. Functional annotation of a full-length Arabidopsis cDNA collection. Science 296: 141145.
Shapiro, M.B. and Senapathy, P. 1987. RNA splice junctions of different classes of eukaryotes: Sequence statistics and functional implications in gene expression. Nucleic Acids Res. 15: 71557174.
Sorek, R. and Safer, H.M. 2003. A novel algorithm for computational identification of contaminated EST libraries. Nucleic Acids Res. 31: 10671074.
Sorek, R., Shamir, R., and Ast, G. 2004. How prevalent is functional alternative splicing in the human genome? Trends Genet. 20: 6871.[CrossRef][Medline]
Stamm, S., Zhang, M.Q., Marr, T.G., and Helfman, D.M. 1994. A sequence compilation and comparison of exons that are alternatively spliced in neurons. Nucleic Acids Res. 22: 15151526.
Stapleton, M., Carlson, J., Brokstein, P., Yu, C., Champe, M., George, R., Guarin, H., Kronmiller, B., Pacleb, J., Park, S., et al. 2002. A Drosophila full-length cDNA resource. Genome Biol. 3: research0080. 00810080.0088.
Stephens, R.M. and Schneider, T.D. 1992. Features of spliceosome evolution and function inferred from an analysis of the information at human splice sites. J. Mol. Biol. 228: 11241136.[CrossRef][Medline]
Szathmary, E., Jordan, F., and Pal, C. 2001. Molecular biology and evolution. Can genes explain biological complexity? Science 292: 13151316.
Tacke, R. and Manley, J.L. 1995. The human splicing factors ASF/SF2 and SC35 possess distinct, functionally significant RNA binding specificities. EMBO J. 14: 35403551.[Medline]
Tacke, R., Tohyama, M., Ogawa, S., and Manley, J. 1998. Human Tra2 proteins are sequence-specific activators of pre-mRNA splicing. Cell 93: 139148.[CrossRef][Medline]
Thompson, J.D., Higgins, D.G., and Gibson, T.J. 1994. CLUSTAL W: Improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res. 22: 46734680.
Tomita, M., Shimizu, N., and Brutlag, D.L. 1996. Introns and reading frames: Correlation between splicing sites and their codon positions. Mol. Biol. Evol. 13: 12191223.[Abstract]
Zavolan, M., van Nimwegen, E., and Gaasterland, T. 2002. Splice variation in mouse full-length cDNAs identified by mapping to the mouse genome. Genome Res. 12: 13771385.![]()
CiteULike
Connotea