|
|
||||||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
BIOINFORMATICS |
1 The Jackson Laboratory, Bar Harbor, Maine 04609, USA
2 Functional Genomics Program, University of Maine, Orono, Maine 04473, USA
3 Department of Molecular, Cellular, and Developmental Biology, University of Colorado, Boulder, Colorado 80309, USA
| ABSTRACT |
|---|
|
|
|---|
Keywords: trans-splicing; polyadenylation; bioinformatics; mRNA processing; C. elegans
| INTRODUCTION |
|---|
|
|
|---|
Eukaryotic mRNA transcripts are typically processed at their 3'-ends in a two-step reaction (Minvielle-Sebastia and Keller 1999
) (referred to as 3'-processing or polyadenylation) including cleavage of the precursor transcript, followed by the addition of a non-templated polyadenylate (poly[A]) tail of up to a few hundred nucleotides in length, depending on organism and cellular environment (Colgan and Manley 1997
; Minvielle-Sebastia and Keller 1999
; Zhao et al. 1999
; Edmonds 2002
). Creation and maintenance of the poly(A) tail has significant implications for subsequent stability, processing, and translation of the transcript (Zhao et al. 1999
; Edmonds 2002
; Kuersten and Goodwin 2003
).
Organisms with their genes organized into operons, such as nematodes, primitive chordates, and hydrae (for review, see Blumenthal 2004
), include a significant complication to the 3'-processing procedure, in that it must include mechanisms for polyadenylation of the upstream fragment and trans-splicing of a 5'-capped leader RNA to the downstream fragment (Blumenthal 1998
). While typical eukaryotic 3'-end formation leads to transcription termination, within operons the presence of additional genes downstream requires both extended transcription and protection and subsequent processing of the downstream portion of the transcript. For clarity, we use the terms "internal site" to refer to a 3'-processing site on an upstream gene of a polycistronic transcript and "terminal site" to refer to 3'-processing sites on either monocistronic or 3'-terminal genes in operons.
Previous studies of C. elegans have demonstrated a mechanistic link between the upstream polyadenylation and downstream trans-splicing events (Liu et al. 2001
), using the specialized SL2 small nuclear ribonuclear protein (snRNP). The SL2 snRNP has been shown to interact with the cleavage stimulation factor (CstF) (Kuersten et al. 1997
; Evans et al. 2001
; Huang et al. 2001
; Liu et al. 2001
), a three-polypeptide complex that, together with cleavage polyadenylation specificity factor (CPSF) and the poly(A) polymerase, is minimally required for the complete 3'-processing reaction (Colgan and Manley 1997
; Minvielle-Sebastia and Keller 1999
; Zhao et al. 1999
). Studies of mammalian systems demonstrated that CstF binds to U-rich sequence elements on the precursor mRNA downstream from the 3'-processing site, specifically through an interaction of subunit CSTF2 (previously referred to as CstF64) (MacDonald et al. 1994
). While the homology of the C. elegans CSTF2 is unmistakable, its sequence is diverged significantly with respect to CSTF2 in other metazoan species (Salisbury et al. 2006
). Our recent studies correlated the protein sequence differences with changes in the apparent pre-mRNA binding sequences, both in content and in positioning (Salisbury et al. 2006
).
Complete 3'-processing signals for metazoans other than nematodes minimally include the canonical hexamer polyadenylation signal (PAS) element, most commonly manifested as AAUAAA and positioned 10–30 nucleotides (nt) upstream (5') of the processing site, and two downstream elements (DSEs), a proximal UG-rich element, typically positioned 5–15 nt downstream from the processing site, and a more distal U-rich element, typically positioned 15–25 nt downstream (Zhao et al. 1999
; Hu et al. 2005
; Salisbury et al. 2006
). In contrast, C. elegans sequences show no evidence of a UG-rich element, and the U-rich element shows a broadened positioning range, typically falling between 5 and 25 nt downstream (Hajarnavis et al. 2004
; Salisbury et al. 2006
). Furthermore, the PAS appears to be much more tolerant of sequence divergence than in other animals: more than half of the genes have at least one mismatch in this hexamer, and
6% of genes have no appropriately positioned discernable match to AAUAAA at all (Blumenthal and Steward 1997
; Hajarnavis et al. 2004
).
Operon intergenic regions have a strong tendency to be quite short (
100–120 nt long) (Blumenthal et al. 2002
), which suggests there is an important mechanistic connection between the upstream 3'-end formation site and the SL2-specific trans-splice site. Indeed, when 3'-processing is disabled (e.g., through elimination of the AAUAAA element), SL2 trans-splicing is reduced, and when a weak site is strengthened (e.g., conversion of AGUAAA to AAUAAA), SL2 trans-splicing 100 nt downstream is increased (Evans et al. 2001
; Liu et al. 2001
). Furthermore, studies of the gpd-2/gpd-3 operon revealed an evolutionarily conserved U-rich (Ur) sequence element in the intergenic region that is required for SL2 trans-splicing and for any accumulation of downstream gene RNA (Huang et al. 2001
). When the Ur element was eliminated in the context of an otherwise wild-type operon, no downstream gene expression was detected, suggesting that either the downstream RNA is degraded following 3'-end formation or transcription terminates. When the AAUAAA was deleted to inhibit 3'-end formation in the context of the Ur mutation, downstream mRNA was restored, but without the Ur element none of it was trans-spliced to SL2.
C. elegans also has a second type of operon, trans-spliced exclusively by SL1. These operons, termed SL1-type operons, are characterized by intergenic lengths of zero nt, with the AAUAAA signal for 3'-end formation of the upstream gene only a few nucleotides upstream of the trans-splice site of the downstream gene (Williams et al. 1999
). In addition, these operons have long polypyrimidine (poly[Y]) tracts in the 3'-UTRs of the upstream genes that are required for correct operon pre-mRNA processing. The conserved poly(Y) tract is necessary for 3'-processing of the upstream gene, but not for trans-splicing to the downstream gene. The investigators concluded that the processing of this operon was possibly mutually exclusive in that standard 3'-processing of the upstream gene precluded SL1 processing of the downstream gene, and conversely, SL1 trans-splicing at least inhibits the formation of the 3'-end of the upstream gene.
The present study was initiated to better characterize the sequences responsible for directing and regulating the trans-splicing reactions, including the differentiation between internal polycistronic and terminal 3'-processing sites, and the selection between SL1 and SL2-type spliced leaders. We find that the signal that specifies internal sites can be primarily defined by a U-rich signal (core consensus UAUUUU) positioned 35–65 nt downstream from the 3'-processing site. This element presumably is the Ur element previously implicated in SL2-specific trans-splicing. In addition, we report on a novel element that occurs upstream of SL1 trans-splice sites, possibly defining the function and extent of the outrons. Integrating our findings with previous results, we present updated models for the molecular control of C. elegans trans-splicing and operon pre-mRNA processing.
| RESULTS |
|---|
|
|
|---|
All sequences were initially extracted including 200 nt up and downstream of the processing site. Sequence Logo (Schneider and Stephens 1990
) representations (Fig. 1) of the single-nucleotide frequency profiles near the functional sites reveal several expected phenomena. All classes of 3'-processing sites (Fig. 1A–C) display strong A-rich elements
15–20 nt upstream of the processing site, characteristic of the PAS, and a general U-rich characteristic immediately up- and downstream of the PAS. A comparison of the internal 3'-processing sites with terminal sites reveals several regions of interest, including regions with apparent elevated information content (indicating nonrandom, potentially functional sequence) around positions –45 to –50, –10 to –5, +10 to +20, and +45 to +55 relative to the site of cleavage. The SL1-type operons present a more stark contrast, with strong evidence of the highly conserved 3'-splice site heptamer sequence (Kent and Zahler 2000
; Hollins et al. 2005
) at the joint 3'-processing/trans-splice site, and a seeming absence of any signal in the region downstream from the processing site.
|
Comparative analysis of the internal and terminal 3'-processing sites (Table 1, Part 1) was dominated by a group of words over-represented in the internal sequences in the sequence windows between 40 and 60 nt downstream from the 3'-processing site, including UACUU, UAUCU, UAUUU, CUUUU, and UUUCU. The position and content of these sequence elements are consistent both with the variation in single nucleotide frequencies (Fig. 1) and also with a previously described "Ur" element, demonstrated to be necessary for successful processing of the gpd-2/gpd-3 operon (Huang et al. 2001
). In addition, the pentamers CAGAU (window 90–100), AGAUG (window 100–110), and GCAGA (window 190–200) are all consistent with a downstream 3'-splice site sequence necessary for trans-splicing of the SL2 leader sequence.
|
90 nt upstream of the processing site. The significance of this element is discussed further below.
Nonnegative matrix factorization (NMF)
While DPWC analysis can identify specific sequences that are differentially used between two sequence sets, it cannot give us a detailed picture of the sequence content and positioning constraints of the degenerate elements involved. For that we turn to a novel analysis based on NMF analysis of the PWC matrixes (Materials and Methods). NMF is a dimensional reduction algorithm that assumes that the matrix under analysis can be adequately approximated as a linear superposition of a small number of patterns, represented by basis vectors. The nonnegative constraint on the elements in the basis vectors results in elements that represent discrete components of the underlying data, which are sequence motifs and patterns in our work. In Figures 2 and 4, below, each basis vector is described with a line plot that shows its positioning distribution and a corresponding sequence logo (Schneider and Stephens 1990
) that shows its sequence content.
|
|
Examination of the NMF patterns for the 3'-processing sites reveals several patterns that are consistent with the single-nucleotide and DPWC analysis described above. The classic PAS hexamer is clearly visible as Element 2 in Figure 2, A and B, and to a lesser extent in Figure 2C, where the positioning is as expected, but the sequence content is only a weak match.
The terminal 3'-processing site was described with seven elements, of which three imply strong motifs: Element 1, representing a U-rich element at or near the processing site and upstream of the PAS; Element 2, the canonical poly(A) signal, with optimal sequence AAUAAA and occurring around position –20; and Element 4, an A-rich signal with a peak in positioning probability near position +25. Elements 5–7 are representative of varying background content, given the measured combination of low information content sequence patterns and slowly varying, but skewed positioning probabilities. Element 3 is specifically positioned to occur in the putative 3'-UTR, with a low contribution to total sequence counts (based on the relative magnitude of the line plot), but a U-rich sequence pattern somewhat similar to Element 3 in the SL1 sites (Fig. 4A). Possible implications of this similarity are investigated in the Discussion.
The NMF analysis of the internal 3'-processing sites (Fig. 2B) revealed a more complex pattern compared to the terminal sites. The regions immediately surrounding and upstream of the putative 3'-processing site are similar to that of the terminal sites, with a probable PAS (Element 2) and U-rich signal (Element 1) clear from both sequence content and positioning. Element 4 is the most intriguing additional pattern, since it overlaps in both sequence content (core consensus AGAUAUUU) and positioning (peaking roughly between positions +45 and +65) with the pentamers identified as over-represented in internal sites with the DPWC analysis (Table 1, Part 1). This element is referred to as the "Ur element" for the remainder of this manuscript. In addition, Elements 5 and 6 contain evidence, both in positioning and sequence content, of the trans-splice site and initiation codon, respectively, as would be expected for downstream genes in operons processed by the SL2 snRNP. As with terminal sites, there is a U-rich element (Element 3) with a significant positioning probability in the putative 3'-UTR region.
The motifs generated by NMF analysis of the SL1-type sites are somewhat less precise than those of either the internal or terminal sites, as is to be expected with the smaller sequence set. Interestingly, while the sequence motifs are difficult to interpret, the positioning distributions show patterns similar to both 3'-processing sites (Fig. 2A,B), and also the larger set of SL1 trans-splicing sites (see Fig. 4A below). Besides the probable PAS (Element 2), a number of other putative elements can be seen, including the trans-splice site (Element 6), upstream U-rich sequences (Element 4), and a motif (Element 3) similar to the "Ou element" described in detail below.
Distinguishing classes of trans-splicing sites: SL1 versus SL2
As described in Materials and Methods, we obtained the following training sets of sequences for trans-splice sites:
The three classes of 3'-splice sites (Fig. 3A–C) all display the previously described highly conserved heptamer sequence with consensus UUUUCAG (Kent and Zahler 2000
; Hollins et al. 2005
), along with evidence of a potential branch point, represented by the increase in A frequency around 15–20 nt upstream of the 3'-splice site. SL1 trans-splice sites appear to have two specific regions of interest, including an elevated information content (with an apparent increased pyrimidine content) from positions –50 to –45, and also a significant switch from A to G at position 0, when compared with either SL2 trans-splice or intronic 3'-splice sites. This variation holds even when controlling for AUG start codons immediately following the trans-splice site (data not shown). In addition, the 3'-splice heptamer shows elevated information content, especially reflecting increased occurrence of the preferred uracils at positions –6 and –5. The SL2 trans-splice site regions appear to be characterized by a general increased U-content over a broad range covering at least positions –60 to –20.
|
|
To further characterize these sequences, we also performed the DPWC analysis between each of the trans-splicing site collections and our randomly selected set of 3913 intron 3'-splice sites from standard cis-spliced introns. The difference in nucleotide content, as already seen in the single nucleotide plots (Fig. 3), is strong enough such that several hundred word–window pairs pass the FDR <0.20 threshold. We report here only the top 25 scoring pairs for the SL1-intron and SL2-intron comparisons (Table 2, Parts 2 and 3, respectively). A complete list of all pentamer–window pairs that pass the FDR <0.2 criterion is available as Supplemental Material at http://harlequin.jax.org/nematode/.
As with the SL1–SL2 comparison, the SL1–intron comparison (Table 2, Part 2) reveals both the elevation of cytosine and pyrimidine-rich sequences between positions –60 and –40 near SL1 sites, as well as the preference for a guanosine base immediately following the trans-splice site.
The SL2–intron comparison (Table 2, Part 3) in the –60 to –40 range shows evidence of both the C-rich sequences favored by the SL1 site, as well as several of the pentamers identified as over-represented downstream from internal 3'-processing sites as compared with terminal sites. In addition, there is an excess of pentamers with an AUG initiation codon surrounding the SL2 trans-splice site.
NMF
We generated NMF decompositions of the SL1 (Fig. 4A), SL2 (Fig. 4B) trans-splice sites, and intron (Fig. 4C) 3'-splice sites. The NMF analysis of the three classes of 3'-splice site are not surprisingly dominated by the highly constrained 3'-splice site, shown as Element 2 in Figure 4, A–C. Similar to the single nucleotide analysis (Fig. 3), the NMF sequence motifs show the increased preference for guanosine immediately after the trans-splice site for SL1 when compared with either SL2 trans-splice or intronic 3'-splice sites. In addition, all three classes of 3'-splice sites display putative branch point patterns (Element 1 in Fig. 4A–C). Interestingly, while the AU-rich content of the putative branch point is consistent with previous characterization (Lim and Burge 2001
), the specific arrangement within the putative motif is not. Previous description of the branch point indicated a consensus of UUUAAA, whereas our analysis generates a motif with the adenosines upstream of the uracils.
A striking new finding of our analysis is the characterization of an outron-specific sequence (Element 3 in Fig. 4A), referred to in the rest of this manuscript as the "Ou element." Previous reports have not identified defining characteristics of the outron. Element 4 in the analysis of SL1 trans-splice sites shows clear evidence of the expected AUG initiation codon, and from positioning (Fig. 4A) shows a strong preference for positioning close to the trans-splice site. Elements 5–7 primarily reflect changes in background nucleotide probabilities, although Element 7 has a characteristic CAA pattern that also appears in a similar position near SL2 trans-splice sites (Elements 4 and 8) (Fig. 4B) and intron 3'-splice sites (Element 4) (Fig. 4C).
The NMF analysis of SL2 trans-splice sites (Fig. 4B) includes a strong U-rich motif (Element 3) that appears to be a composite of the Ur element and the 3'-processing DSE, which represents the putative binding site for CSTF2. The positioning distribution of this element is bimodal, with peaks
50 and 90 nt upstream of the trans-splice site, positions that are consistent with the standard 120-nt separation between 3'-processing and trans-splice site. The idea of a mixture is also supported by the sequence content, which is primarily U-rich, with only a small indication of the typical leading A-residue (Fig. 2B). Element 5 represents a U-rich element with a broad positioning distribution consistent with the 3'-UTR of the upstream gene, assuming the standard 100–120-nt intergenic separation. Element 4 shows clear evidence of the AUG initiation codon, while Elements 6–8 reflect changes in the sequence background.
The NMF analysis of the intronic 3'-splice sites (Fig. 4C) reveals a somewhat simpler pattern, with clear evidence of the hexamer (Fig. 4C, Element 2), branch point (Element 1), and upstream 5'-splice site (Fig. 4C, Element 3). The positioning distribution for Element 6 reflects sequences that are much more prevalent in exons than in introns.
The relationship between SL2 trans-splicing and intergenic separation
Although most operon genes are spaced only
100 base pairs (bp) apart, some have much greater intergenic distances. If the spacing between genes is an important instructive element of SL2 trans-splicing, we would expect that operons with greater intergenic spacing would have lower levels of SL2 trans-splicing. Previous studies revealed a class of operon in which the downstream gene can be (and shows molecular evidence of) processing to include either SL1 or SL2 spliced leaders (Blumenthal et al. 2002
). These operons have shown at least anecdotal evidence of longer than expected intergenic separation. This phenomenon can be verified quantitatively with our data. We identified eight genes with trans-splice sites with evidence of both SL1 and SL2 splicing, measured the fraction of the observed EST sequences with SL1 leader sequence, and plotted the resulting values as a function of the intergenic separation (Fig. 5). A strong direct correlation (Pearson r=0.74) is obvious, verifying the previous observations: SL1 processing of downstream genes in operons is highly correlated with intergenic separation. Of particular interest are two pairs of trans-splice sites that arise in genes that lie downstream in two distinct operons (Fig. 5, F59G1.1b, represented by open circles; C30H7.2b, represented by squares). In both cases, the further distal trans-splice site shows increased SL1 processing, demonstrating directly the correlation with intergenic separation within a single operon.
|
| DISCUSSION |
|---|
|
|
|---|
An updated model of the SL2 precursor mRNA interaction
Our studies of internal 3'-processing sites significantly extend the model presented in previous work (Huang et al. 2001
), which proposed that internal sites were specified by an intergenic "Ur element." The Ur element was characterized as an evolutionarily conserved sequence (in comparison with Caenorhabditis briggsae) in the operon that contains genes gpd-2 and gpd-3 (Huang et al. 2001
). The position of the functional sequences identified through a deletion procedure is consistent with the +40 to +70 window we observe in the current analysis. The previous study included analysis of a number of additional operons. However, the positioning was measured with respect to the PAS element, rather than the 3'-processing site, and the short distances reported for a number of genes (as small as
25 nt) led us to conclude that some of these sequences may be 3'-processing elements, rather than Ur elements.
The positioning distribution of the Ur element has its primary mode at
50 nt downstream from the 3'-processing site, and a second potential mode around 90 nt downstream. The second mode is questionable, however, as the positioning and sequence content is also consistent with the most common positioning of the trans-splice site. In general, the evidence presented here, as well as previous work (Huang et al. 2001
), indicates that the Ur element is constrained by the location of the 3'-processing site rather than the trans-splice site. A pattern indicative of the Ur element occurs in both the analysis of the internal 3'-processing sites (Fig. 2B) and the SL2 trans-splice sites (Fig. 4B), but the positioning has a sharper distribution near the 3'-processing sites. This apparent connection with the 3'-processing site has implications for the processing of operons with long intergenic separation, as discussed below.
Our analysis also highlights several other phenomena of interest in the comparison of internal and terminal 3'-processing sites. The PAS element, optimally manifested as AAUAAA, is conserved throughout metazoans, and to a reduced level in other eukaryotes (Graber et al. 1999a
,b
; Zhao et al. 1999
; Brockman et al. 2005
; Lee et al. 2007
). Comparison of exact matches of AAUAAA in internal sites (76/182=0.418) and with terminal sites (487/931=0.523) indicates a statistically significant (p
0.009, large sample test of equal proportions) reduction in internal sites. Previous studies have indicated that near matches to AAUAAA result in less efficient processing (Sheets et al. 1990
), thus the strength of 3'-processing may be modulated, with correspondingly modulated expression of the downstream operon genes. Alternatively, the constraints on the PAS could be reduced due to the stronger constraints on the downstream elements, including the CstF-binding site and the Ur element. This model is consistent with the observation that the overall "strength" of a 3'-processing site is determined as the net effect of all components, such that strong manifestations of one element compensate for weaker manifestations of others (Beyer et al. 1997
; Graber et al. 1999b
). It is reasonable that the sequence requirements for successful polycistronic processing could include strong downstream 3'-processing elements, with relaxed constraints on the PAS element. Indeed, a tetramer-based DPWC analysis (data not shown) demonstrated a significant overabundance of the UUUU tetramer in the 20 nt downstream from internal sites, a sequence and positioning consistent with CstF–mRNA interaction.
We can now delineate specific positioning and sequence content for the Ur element and update the processing model (Fig. 6A). We restrict this model to the standard cases, operons that have
100 nt between the upstream 3'-processing site and downstream SL2 trans-splicing site. Special models for operons with longer and shorter intergenic separation are discussed below.
|
The sequence of the Ur element is intriguingly similar to the Sm protein binding site (Hartmuth et al. 1999
; Achsel et al. 2001
). Sm proteins are integral components of snRNPs, including the SL snRNPs (Blumenthal 2005
). Previous studies produced a consensus Sm binding sequence of RAUUUUUGA (Hartmuth et al. 1999
). We find that (Fig. 2) the Ur motif is characterized as a uracil run following an adenosine. This was further confirmed with standard pattern recognition programs such as the Gibbs Sampler (data not shown; Lawrence et al. 1993
). Clearly the Sm proteins would be interesting candidates for further investigation for a direct role in tethering the pre-mRNA to the SL2 snRNP for polycistronic processing.
A potential novel A-rich downstream signal in terminal 3'-processing sites
Examination of the patterns identified near the terminal 3'-processing sites indicated the potential of an A-rich element with a positioning mode
30 nt downstream from the 3'-processing site (Fig. 2B, Element 4). Such an element was hinted at in a previous study of downstream elements in metazoan 3'-processing sites (Salisbury et al. 2006
).
The Ou element: Defining the outron and SL1 trans-splice sites
Our analysis of SL1 trans-splice sites revealed a novel element that we term the "Ou element" (Fig. 4A, Element 3) in the upstream region that can potentially help to define the boundaries of the outron (Fig. 6B). Positioning of the Ou element shows a mode around 50 nt upstream of the trans-splice site, and is characterized by a UC-rich sequence.
All classes of 3'-splice site are similarly positioned with respect to an upstream element. C. elegans introns have a tight length distribution compared with other metazoans, with the majority of the introns falling near 50 nt in length, which places the 5'-splice site at –50. As stated above, the Ou element is similarly positioned. Finally, while we believe that the Ur element is functionally tied to the 3'-processing site through its interaction with CstF, the majority of known operons have an intergenic spacing around 100–120 nt, a distance that typically places the Ur element around 50 nt upstream of the SL2 trans-splice site. The similarity of the spacing is suggestive of a common origin for these processes, as has been postulated in some models of trans-splicing (Blumenthal and Steward 1997
; Guiliano and Blaxter 2006
).
Large separation operon processing
The models we propose for control of SL1- and SL2-mediated trans-splicing provide insight into the variation in SL1/SL2 ratio with intergenic spacing (Fig. 5). In our model, the interaction of the SL2 snRNP with the pre-mRNA, mediated through CstF, is dominant in operons with standard short spacing (100–120 nt), possibly through steric exclusion. As the distance between the 3'-processing and trans-splicing sites grows, the dominance of SL2 processing wanes. There are several potential causes for the change (Fig. 6C): First, the increased separation reduces the efficiency of the interaction between the SL2 snRNP and the trans-splice site because of increased physical separation. Second, the increased intergenic separation makes possible the insertion of an Ou element into the spacing that can recruit the SL1 processing complex. A third possibility is that long intergenic regions could contain an internal promoter, the product of which would be expected to be entirely SL1-trans-spliced.
SL1-type operon processing
As described in the Introduction, SL1-type operons are characterized by a 0-bp-length intergenic separation, with a clear trans-splice site immediately adjacent to the apparent 3'-processing site (Williams et al. 1999
). Our findings, and the models proposed above for SL1 and SL2 processing, easily accommodate this class (Fig. 6D). The positioning and sequence content that we characterize for the Ou element near SL1 trans-splice sites (Fig. 4A, Element 3) is consistent with the required pyrimidine-rich element located in the 3'-UTR of the upstream gene. NMF analysis of SL1-type processing sites resolves two distinct, but overlapping, signals upstream of the processing site (Fig. 2C). The positioning of Element 3 is consistent with the Ou element, whereas Element 4 is a U-rich element with positioning that extends up to
150 nt upstream of the processing site. It seems reasonable to assume that the extended poly(Y) tract represents an auxiliary 3'-processing element, necessary in SL1-type processing sites to compensate for the absence of the downstream U-rich CstF-binding site (Figs. 1C, 2C).
The processing of these operons is not necessarily mutually exclusive. CstF, which is known to bind downstream from the 3'-processing site, is required only for the cleavage step (Gilmartin and Nevins 1989
; Takagaki et al. 1989
; Zhao et al. 1999
). CPSF, which binds upstream of the 3'-processing site, is required for both cleavage and polyadenylation (Gilmartin and Nevins 1989
; Takagaki et al. 1989
; Keller and Minvielle-Sebastia 1997
; Zhao et al. 1999
). In vitro experiments in yeast demonstrated that these processes could be uncoupled, and that CPSF can bind to a cleaved pre-mRNA and recruit the additional factors necessary for addition of the poly(A) tail. In SL1-type operons, the cleavage step is performed by the SL1 trans-splicing reaction, leaving the 5'-portion of the polycistronic pre-mRNA in a state for binding by CPSF and subsequent polyadenylation. Presumably this RNA would need to be debranched prior to polyadenylation. It is an interesting possibility that the AAUAAA may serve as the branch site in these operons.
Alternative processing of downstream genes based on selection of upstream 3'-processing sites
The EST evidence at hand provides several intriguing examples for further experimental investigation. Specifically, we have evidence of operons in which the upstream gene has alternative 3'-processing sites that imply different operon processing capacity. One interesting case is operon CEOP4304 (http://www.wormbase.org/), which includes the genes SRP54 (F21D5.7) and F21D5.6. Transcript evidence for F21D5.6 supports trans-splicing of both SL1 and SL2. ESTs support two 3'-processing sites for SRP54, a proximal site at +370 (downstream of the stop codon), and a distal site +500. The downstream site is nearly coincident with the trans-splice site for F21D5.6, implying an SL1-type operon. In contrast, the proximal 3'-processing site is situated at the typical separation
120 nt from the trans-splice site, and has reasonable matches to the Ur element. The intervening sequence, which can be included or excluded in the final SRP54 transcript based on 3'-processing site selection, contains four different microRNA-binding sites as predicted by miRanda (Enright et al. 2003
). Such alternative operon processing provides the possibility of subtle variations in expression and subsequent post-transcriptional regulation of genes through control of the processing of the precursor transcript.
| MATERIALS AND METHODS |
|---|
|
|
|---|
We obtained SL1 and SL2 trans-splice sites in a two-step process. First, BLAT (Kent 2002
) was used to align the SL1 and SL2 leader sequences to all C. elegans ESTs. EST sequences with full-length leader-EST alignments were subsequently aligned to the genome, and finally sequence was extracted up and downstream of the putative trans-splicing sites. We obtained 250 and 1217 SL1 and SL2 trans-splice sites, respectively. No attempt was made in this initial analysis to remove sites with evidence of both SL1 and SL2 trans-splicing activity, based on the reasoning that such sites presumably contain regulatory sequence elements for both SL1 and SL2 trans-splicing.
Standard cis-splicing 3'-splice sites were obtained from the ENSEMBL C. elegans annotations (version 43.160a) (Hubbard et al. 2007
). Following reduction to unique sites, we extracted sequence up and downstream of the 3'-splice site. We extracted 107,755 unique 3'-splice sites, and subsequently randomly selected a set of 3913 for comparison with SL1 and SL2 sites.
Differential positional word counting (DPWC)
Functional sequences in pre-mRNA, for example, protein-binding sites, are frequently constrained both by sequence content and positioning with respect to a functional site. We have previously used positional word counting (PWC) techniques to characterize the functional elements across a broad range of metazoan species (Salisbury et al. 2006
). In the present study, we have adapted this technique to identify sequence words that occur at different frequencies with specific positioning in two related sequence sets. In brief, the method includes the following steps:
0.2. |
| (1) |
In practice, we frequently find that groups of related words pass our statistical tests, representing either acceptable substitution (e.g., AAUAA and AAUGA) or positioning offsets (e.g., AAUAA and AUAAA).
Nonnegative matrix factorization (NMF) characterization of RNA regulatory signals
To obtain probabilistic models of the positioning and sequence content of the functional elements that specify RNA processing sites, we used a combination of PWC and nonnegative matrix factorization (NMF), a data reduction approach first developed for image processing (Lee and Seung 1999
), which has subsequently gained popularity in processing high-dimensional biological data such as microarrays (Kim and Tidor 2003
; Carmona-Saez et al. 2006
; Pascual-Montano et al. 2006
). NMF, similar to principal component analysis, generates "basis vectors" to reduce the dimensionality of large data sets. The NMF approach to motif characterization makes three assumptions:
Finally, our method is specifically designed to find motifs with constrained positioning. Motifs with random position distributions will not be detected. We summarize the method here: a more complete description is available elsewhere (L.N. Hutchins, S.P. Murphy, P. Singh, and J.H. Graber, in prep.).
In NMF, the two-dimensional PWC array V (size m words x n position windows) is decomposed into two new matrixes ( V = WH ), where W (size m words x r bases) contains the basis vectors, H (size r bases x n position windows) contains the multiplicative coefficients, and r is the number of basis vectors. We implemented an identical multiplicative update and objective function for maximization as reported previously (Lee and Seung 1999
). Note that in contrast with the DPWC analysis, Vij is the total count of word i occurring in window j, rather than the number of sequences with at least one occurrence. For small sequence sets (
100 or less), we find it useful to smooth the positioning distribution for each individual word. Following NMF, element Wij represents the contribution of the ith k-mer to the jth basis vector. Element Hjk represents the relative contribution of the jth basis vector to position k in the count vector, and as such, the r rows of H can be interpreted as the positioning probability for the regulatory motifs represented by their respective basis vectors.
The original authors reported that the nonnegativity constraint resulted in basis vectors that represented discrete components of the images. We interpret the r basis vectors as representing distinct patterns of defined sequence content and positioning. The resulting basis vectors can reflect both specific elements (such as the canonical polyadenylation hexamer) as well as any positional variation in composition (such as the change from UTR to intergenic sequence at the 3'-processing site). Selection of r, the number of basis vectors, is a complex problem (Brunet et al. 2004
; Pascual-Montano et al. 2006
). Using artificially generated data sets, we found that the plot of mean square difference between the original data and the NMF approximation (MSD=
[Vij –WHij ]2/MN) shows an inflection when r matches the number of clusters used to generate the data (L.N. Hutchins, S.P. Murphy, P. Singh, and J.H. Graber, in prep.). Real data display a similar behavior; therefore we use the MSD versus r plot to select the operational value for r.
The NMF algorithm as implemented is a hill-climbing approach, guaranteed to give the best local solution, given the initial estimates of W and H . We restart the analysis at least several hundred times, keeping the solution with the maximum objective score. While this does not guarantee identification of the globally optimal solution (or even necessarily a unique solution) (Donoho and Stodden 2004
), in practice we find that the dominant patterns (e.g., the AAUAAA-like PAS signal) are stably present in the solutions represented by the local maxima.
In order to convert the weighted list of k-mers for each elements to a probabilistic motif, we use the insight that the weights W can be reasonably interpreted as the expected distribution of counts for each k-mer for each basis vector. Given this assumption, the optimal motif is defined as the motif with the maximum multinomial probability (Eq. [2]) of observing the expected count distribution n as specified by the weights W . Since we are trying to maximize the probability of the motif to generate the observed weights (represented by the counts n in Eq. [2]), the constant C( n ) can be neglected, allowing us to search for a motif that maximizes the logarithmic probability.
|
| (2) |
|
| (3) |
The calculation of pi , the probability of observing the ith k-mer, is shown in Equation (3), where wi is the ith k-mer, L is the length of the motif plus k-1 background positions padding on each side of the motif, and p(wi ,j) is probability of observing wi at position j. As shown, pi is calculated as an average of the probability across the motif, including positions that span the boundary between motif and background, a choice that addresses two specific issues. First, clustered k-mers frequently include sequences that partially overlap, such as AAUAAA and AUAAAA. Second, the underlying motifs can be of variable size, larger or smaller than k. Simulations indicate that we can characterize motifs of nearly arbitrary size, regardless of choice of k (L.N. Hutchins, S.P. Murphy, P. Singh, and J.H. Graber, in prep.). We build a single-nucleotide (0-th order) model, and the final reported motif does not include the flanking background padding positions.
To optimize the search of potential motifs, we use a Markov chain Monte Carlo (MCMC) approach (Gelman et al. 1995
). We start by sampling the length of the motif from a user specified range, and then randomly filling the nucleotide probabilities at each position. The initial motif is scored against the expected multinomial distribution. Following a standard MCMC approach, the motif is updated (L.N. Hutchins, S.P. Murphy, P. Singh, and J.H. Graber, in prep.), and updated motifs are automatically accepted if the multinomial probability increases, and randomly accepted according to a Boltzmann distribution if the multinomial probability decreases (Gelman et al. 1995
). This process iterates until no improvement is observed for a user-specified number of iterations. In practice, we restart the MCMC process between 50 and 500 times, iterating until no improvement is seen for at least 250 iterations, keeping the best overall solution obtained.
The analysis tools described here are implemented as four distinct C++ programs. Linux binaries and source code are available from the authors by request.
| ACKNOWLEDGMENTS |
|---|
|
|
|---|
| Footnotes |
|---|
Article published online ahead of print. Article and publication date are at http://www.rnajournal.org/cgi/doi/10.1261/rna.596707.
Received April 12, 2007; accepted May 17, 2007.
| REFERENCES |
|---|
|
|
|---|
Achsel, T., Stark, H., and Lührmann, R. 2001. The Sm domain is an ancient RNA-binding motif with oligo(U) specificity. Proc. Natl. Acad. Sci. 98: 3685–3689.
Benjamini, Y. and Hochberg, Y. 1995. Controlling the false discovery rate: A practical and powerful approach to multiple testing. J. R. Stat. Soc. Ser. B Methodol. 85: 289–300.
Beyer, K., Dandekar, T., and Keller, W. 1997. RNA ligands selected by cleavage stimulation factor contain distinct sequence motifs that function as downstream elements in 3'-end processing of pre-mRNA. J. Biol. Chem. 272: 26769–26779.
Blumenthal, T. 1998. Gene clusters and polycistronic transcription in eukaryotes. Bioessays 20: 480–487.[CrossRef][Medline]
Blumenthal, T. 2004. Operons in eukaryotes. Brief. Funct. Genomic. Proteomic. 3: 199–211.
Blumenthal, T. 2005. Trans-splicing and operons. In Community TCeR (ed. The C. elegans Research Community). WormBook.
Blumenthal, T. and Gleason, K.S. 2003. Caenorhabditis elegans operons: Form and function. Nat. Rev. Genet. 4: 112–120.[Medline]
Blumenthal, T. and Spieth, J. 1996. Gene structure and organization in Caenorhabditis elegans . Curr. Opin. Genet. Dev. 6: 692–698.[CrossRef][Medline]
Blumenthal, T. and Steward, K. 1997. RNA processing and gene structure. In C. elegans II (ed. D.L. Riddle), pp. 117–145. Cold Spring Harbor Laboratory Press, Cold Spring Harbor, NY.
Blumenthal, T., Evans, D., Link, C.D., Guffanti, A., Lawson, D., Thierry-Mieg, J., Thierry-Mieg, D., Chiu, W.L., Duke, K., Kiraly, M., et al. 2002. A global analysis of Caenorhabditis elegans operons. Nature 417: 851–854.[CrossRef][Medline]
Brockman, J.M., Singh, P., Liu, D., Quinlan, S., Salisbury, J., and Graber, J.H. 2005. PACdb: PolyA cleavage site and 3'-UTR database. Bioinformatics 21: 3691–3693.
Brunet, J.P., Tamayo, P., Golub, T.R., and Mesirov, J.P. 2004. Metagenes and molecular pattern discovery using matrix factorization. Proc. Natl. Acad. Sci. 101: 4164–4169.
Carmona-Saez, P., Pascual-Marqui, R.D., Tirado, F., Carazo, J.M., and Pascual-Montano, A. 2006. Biclustering of gene expression data by nonsmooth nonnegative matrix factorization. BMC Bioinformatics 7: 78.[CrossRef][Medline]
Colgan, D.F. and Manley, J.L. 1997. Mechanism and regulation of mRNA polyadenylation. Genes & Dev. 11: 2755–2766.
Donoho, M. and Stodden, V. 2004. When does nonnegative matrix factorization give a correct decomposition into parts? In Advances in neural information processing systems 16 (eds. S. Thrun et al.), pp. 1141–1148. MIT Press, Cambridge, MA.
Edmonds, M. 2002. A history of poly(A) sequences: From formation to factors to function. Prog. Nucleic Acid Res. Mol. Biol. 71: 285–389.[Medline]
Enright, A.J., John, B., Gaul, U., Tuschl, T., Sander, C., and Marks, D.S. 2003. MicroRNA targets in Drosophila . Genome Biol. 5: R1. doi: 10.1186/gb-2003-5-1-r1.[CrossRef][Medline]
Evans, D., Perez, I., MacMorris, M., Leake, D., Wilusz, C.J., and Blumenthal, T. 2001. A complex containing CstF-64 and the SL2 snRNP connects mRNA 3' end formation and trans-splicing in C. elegans operons. Genes & Dev. 15: 2562–2571.