|
|
||||||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
STEM CELL GENETICS AND GENOMICS |
a Department of Obstetrics and Gynaecology, National University of Singapore, National University Hospital, Singapore;
b Department of Biological Sciences, National University of Singapore, Singapore
Key Words. Reverse serial analysis of gene expression • Human embryonic stem cells • Transcriptome • Antisense transcription • POU5F1 • SOX2 • NANOG
Correspondence: Woon-Khiong Chan, Ph.D., Department of Biological Sciences, National University of Singapore, 14 Science Drive 4, Singapore 117543. Telephone: 65-6516-8096; Fax: 65-6779-2486; e-mail: dbscwk{at}nus.edu.sg. Ariff Bongso, Ph.D., D.Sc., Department of Obstetrics and Gynaecology, National University of Singapore, National University Hospital, Singapore 119074. Telephone: 65-6772-4129; Fax: 65-6779-4753; e-mail: obgbongs{at}nus.edu.sg
Received July 6, 2005;
accepted for publication January 22, 2006.
| ABSTRACT |
|---|
|
|
|---|
| INTRODUCTION |
|---|
|
|
|---|
SAGE is a sequence-based transcriptome profiling approach that provides qualitative and quantitative assessment of gene expression [20]. The underlying principle assumes that a short nucleotide sequence, or SAGE tag, located at the last anchoring enzyme (Cmost) site contains sufficient information to represent a specific transcript. Often the NlaIII restriction enzyme is used, and the length of the SAGE tag could range from 14 (SAGE) to 21 (LongSAGE) or 26 base pairs (bp) (SuperSAGE), depending on the tagging enzymes used [2022]. The digital nature of SAGE tags means that cumulative SAGE data can easily be merged, allowing large-scale comparisons between independent libraries. The sequencing of concatemerized SAGE tags also permits a high-throughput determination of the transcriptome compared with EST sequencing. Besides being a robust method that reflects accurately the actual relative levels of mRNA transcripts, SAGE also allows transcripts that are expressed at low levels to be efficiently detected [23, 24]. However, the reliance on short sequence tags for gene identification imposes limitations on the precision and accuracy of gene identification. For instance, a SAGE tag may match multiple mRNA transcripts making gene assignment difficult, although with the advent of LongSAGE and SuperSAGE, this problem has been largely solved. A more daunting problem is that many SAGE tags do not appear to match known mRNA transcripts or genes. In poorly characterized transcriptomes, such as those from hESCs [12] and hematopoietic stem cells [18, 19], such orphan SAGE tags could reach as much as 40%. A recent study has shown that approximately 70% of orphan SAGE tags are indeed derived from bona fide transcripts [24], reinforcing the view that SAGE is indeed a powerful method for novel gene discovery. This suggests that a large number of the orphan SAGE tags that we have uncovered in the hESC transcriptome are true representatives of novel genes, transcripts, or splice variants [12], although the total number of genes present in the human genome is estimated at a conservative 30,00040,000 [25, 26].
Another major source of uncertainty in SAGE tag-to-transcript assignment lies in the widespread presence of single-nucleotide polymorphisms (SNPs) within the human genome; SNPs occur as frequently as once every 100300 bases [27, 28]. Occurrence of SNPs within the SAGE tag sequence or within the tagging restriction enzyme site will result in the assignment of an alternative SAGE tag. In a recent large-scale study of the SAGE database, at least one SNP-associated alternative SAGE tag was observed for 8.6% of all known human genes when the influence of SNPs and small insertion/deletion polymorphisms on SAGE tags was taken into consideration [29]. Indeed, the presence of this class of alternative SAGE tags has led to an underestimation of the expression of certain genes (e.g., GAL) and erroneously identified others (e.g., BTF3) as being specific to hESCs [12].
Naturally occurring antisense transcripts (NATs) have been recently reported in a variety of metazoan species [30, 31], and it is likely that a significant portion of the hESC orphan SAGE tags are derived from NATs. There are two main classes of NATs. The cis-encoded NAT (cis-NAT) is transcribed from the opposite strand of the same genomic locus and has the potential to form long complementary duplex with the sense RNA transcript. In contrast, trans-encoded NAT (trans-NAT) is transcribed from another distinct genomic locus, possibly a pseudogene [31], and is generally short and forms imperfect duplex with its sense transcript. The human genome has been shown to express NATs widely [3234], with as many as 20% of human genes forming sense-antisense (SA) transcript pairs [35]. For instance, hESCs have been reported to express a unique set of microRNAs, which belongs to a class of trans-NAT [36]. A recent large-scale EST project has provided an important resource of full-length cDNAs for hESCs [13]. But like the >5 million ESTs that are available [37], they are difficult to use to verify the expression of NATs because many ESTs have not been directionally cloned [3132]. In contrast, SAGE tags are directionally reliable, as they are generated from well-defined restriction sites at the 3' end of each RNA transcript. Thus, large SAGE datasets contain latent information on both sense and antisense transcription [38]. Interestingly, tags matching mRNAs or ESTs in antisense orientation were first observed in SAGE libraries constructed from Plasmodium falciparum [39, 40].
Without additional sequence information, it is difficult to characterize orphan SAGE tags from hESCs and identify the transcripts they represent. Several polymerase chain reaction (PCR)-based strategies have been developed, including reverse SAGE (rSAGE) [41, 42], generation of longer cDNA fragments from SAGE tags for gene identification (GLGI) [43, 44], and rapid analysis of unknown SAGE-tag-PCR [45]. In this report, we have modified the original rSAGE protocol [41, 42], which is also similar to the GLGI [43, 44], and used it to obtain additional 3' cDNA sequence information for a select group of orphan SAGE tags that are expressed specifically in hESCs. Our results identified novel transcripts unique in their expression to hESCs, transcripts that displayed alternative polyadenylation, and novel splice variants of known genes. More importantly, we found NATs for several pluripotency genes, including POU5F1 and NANOG. Collectively, the unique 3' ESTs derived from orphan hESC SAGE tags (HESTs) will be an important resource in downstream functional analyses and the concerted dissection of molecular pathways critical to the pluripotent phenotype of hESCs.
| MATERIALS AND METHODS |
|---|
|
|
|---|
Total RNA Isolation
Total RNA was extracted from hESCs using TRIZOL (Invitrogen, Carlsbad, CA, http://www.invitrogen.com), whereas total RNA from the various somatic and fetal tissues were obtained commercially (Clontech, Palo Alto, CA, http://www.clontech.com). Prior to rSAGE library construction or reverse transcription (RT)-PCR, total RNA was treated with DNase I (Ambion, Austin, TX, http://www.ambion.com) to remove any residual genomic DNA contamination, and PCR using ß-actin primers (forward, 5'-GATGCAGAAGGAGATCACTGC-3'; reverse, 5'-CACCTTCACCGTTCCAGTTT-3'), designed to span the last intron-exon boundary of the gene, was carried out to confirm the absence of genomic DNA.
cDNA Synthesis, NlaIII Digestion, and Linker Ligation
A schematic for the rSAGE library construction with all primer and linker sequences is depicted in Figure 1
. cDNA synthesis was carried out using the Superscript II double-stranded cDNA synthesis kit (Invitrogen) with 10 µg of total RNA from HES3 cells and a biotinylated primer was used (5'-biotin-ATTGGCGCGCCGCGAGCACTGAGTCAATACGAT30VN- 3'; Integrated DNA Technologies, Coralville, IA, http://www.idtdna.com). Double-stranded cDNA was digested with NlaIII (New England Biolabs, Ipswich, MA, http://www.neb.com) to generate 3' overhangs. The biotinylated cDNAs were immobilized on streptavidin-magnetic beads (Invitrogen). Annealed linkers, A1 (5'-AAGCAGTGGTATCAACGCAGAGTCATG-3') and A2 (5'-phosphate-ACTCTGCGTT-GATAC-CACGCTT-aminoC7-3') were ligated to the 5' end of NlaIII-digested cDNA before AscI (New England Biolabs) digestion was performed to release the 3' cDNA fragments from the streptavidin-magnetic beads.
|
Selection of Orphan SAGE Tags and Design of Tag-Specific rSAGE Primers
The 200 orphan SAGE tags selected for rSAGE were identified through a pairwise comparison of HES3 SAGE data against pooled data from 21 human SAGE libraries [12]. The SAGE tag-to-gene database used for gene identification was based on UniGene Build 160 (http://www.ncbi.nih.gov/SAGE/). The majority of the orphan SAGE tags selected were upregulated in HES3 compared with the pooled human SAGE libraries (p < .001; fold difference >4). A table describing the SAGE tags, sequences of the SAGE tag-specific rSAGE primers (TSRPs), and their respective frequencies in tags per million (tpm) in the pooled human, HES3 and HES4 SAGE libraries, is provided as supplemental online Table 1. For those HES SAGE tags where LongSAGE tags were available, which were obtained through comparison with a HES3 LongSAGE library, the TSRPs were designed using the Primer3 software (http://frodo.wi.mit.edu) [46]. Typically, they included the entire 21 bases of the Long-SAGE tag or they included additional four to eight bases of the common linker (CGCAGAGT) and up to 19 bp of the Long-SAGE tag. If no appropriate LongSAGE tag was available (Tag IDs 177), the TSRPs were designed with seven bases of the common linker sequence (GCAGAGT) and the entire 14 bases of the SAGE tag, with the exception of Tag IDs 30 and 72.
rSAGE Amplification Reaction and Characterization of 3' rSAGE Fragments
Touchdown PCRs were performed using an initial denaturation cycle at 94°C for 2 minutes, followed by four cycles at 94°C for 45 seconds, 63°C for 1 minute, and 72°C for 1 minute; four cycles at 94°C for 45 seconds, 60°C for 1 minute, and 72°C for 1 minute; 25 cycles at 94°C for 45 seconds, 58°C for 1 minute, and 72°C for 1 minute; and a final extension step at 72°C for 5 minutes. The reaction setup for rSAGE PCR was as follows: 1 µl of amplified rSAGE library, 1 U of Platinum Taq Polymerase, 350 ng of TSRP and rSAGER1 primer. The PCR products were run on 1.2% TAE agarose gel, and the bands were excised and purified using QIAquick Gel Extraction Kit (Qiagen, Valencia, CA, http://www.qiagen.com). Purified PCR products (24 µl) were ligated into the pGEM-T Easy Vector (0.5 µl) (Promega, Madison, WI, http://www.promega.com) using T4 DNA ligase. The ligation reaction was incubated overnight at 16°C and resuspended in 8 µl of sterile water. Electroporation was performed using 1 µl of the ligated products and 25 ml of pTOP10 cells (Invitrogen). The transformants were plated on selective media, and two to four clones were picked for each rSAGE PCR product. Plasmid DNA was extracted using QIA-prep Spin Miniprep Kit (Qiagen). Sequencing reactions were carried with Big Dye v3.1 (Applied BioSystems, Foster City, CA, http://www.appliedbiosystems.com) and M13 Forward primer. The sequenced products were analyzed on an ABI 3100 DNA Sequencer (Applied BioSystems).
Sequence Analysis and Identification of Genuine rSAGE PCR Products
A bona fide 3' rSAGE product was defined as possessing the entire SAGE tag sequence, the rSAGER1 primer sequence and a poly(A) tract of >10 adenine residues. Sequences that lacked any one of the three were considered nonspecific amplification artifacts and omitted from further analysis. The rSAGE 3' EST sequences were searched against the GenBank Database (NR, dbEST, and human genome) using BLASTN (http://www.ncbi.nlm.nih.gov/BLAST/), the University of California Santa Cruz human genome browser database (May 2004 build) using the BLAT program (http://genome.ucsc.edu/cgi-bin/hgBlat) and the EMBL database using a web interface-based batch BLAST program (http://biomedicum.csc.fi:8010/cgi-bin/batchblast.cgi) [20].
An rSAGE sequence was classified as novel if no matches to a transcript sequence (known gene, mRNA, or EST) were found. A sequence was considered to represent a known gene if it matched a full-length transcript sequence with >95% similarity in the same orientation. A sequence was classified as known EST if it matched an EST or open reading frame (ORF) with >95% similarity in the same orientation. A sequence was classified as an SNP alternative tag if it contained a single-bp mismatch within the SAGE tag sequence or NlaIII site. A sequence was classified as an insertion/deletion if it contained an insertion or deletion of fewer than three nucleotides within the SAGE tag sequence. A sequence was classified as an anti-sense transcript if it matched with high similarity to known transcripts in the opposite orientation. A sequence was classified as poly(A) if it was near the end of the poly(A) tract. Finally, a sequence was considered an alternative isoform if it matched the middle of known full-length transcripts in the same orientation and contained a poly(A) track immediately downstream of the matched region. Genomic coordinates of the 3' SAGE ESTs were annotated based on the University of California Santa Cruz genome browser annotation database (http://genome.ucsc.edu/).
RT-PCR Confirmation of Novel 3' cDNAs
First-strand synthesis was performed using the SuperScript first-strand synthesis system (Invitrogen). One µl of first-strand reaction was used for each PCR together with 50 pmol of forward and reverse primers. Initial denaturation was carried out at 94°C for 2 minutes, followed by 30 cycles of PCR (94°C for 30 seconds, 55°C for 30 seconds, 72°C for 1 minute), and a final extension cycle at 72°C for 5 minutes. PCRs were loaded on a 1.5% agarose gel and size fractionated. In instances where the 3' cDNA sequence obtained was short and no suitable primer pairs could be found, additional 5' genomic sequences were used to anchor the forward primers. In all cases, the reverse primer primed from the rSAGE 3' cDNA sequence. Primers used were as follows. ACTB: product 400 bp, 5'-TGGCACCACACCTTTCTACAAT-GAGC-3', 5'-GCACAGCTTCTCCTTAATGTCACGC-3'; POU5F1: product 247 bp, 5'-CGRGAAGCTG GAGAAG-GAGAAGCTG-3', 5'-CAAGGGCCGCAGCTTACACAT-GTTC-3'; HEST97: product 160 bp, 5'-CCTTTGTCATGAGC-CCTTGT-3', 5'-GGAATGAAAGAATGGTTG CTC-3'; HEST101: product 119 bp, 5'-AAGAGCCTGCTACG-GAACTG-3', 5'-TCACTAGAGGTTTCCAACACACTT-3'; HEST120: product 159 bp, 5'-AAATTTGGTGCTGTGAC TCG-3', 5'-GCGGGCTGAGTCGGATTT-3'; HEST123: product 200 bp, 5'-GGGTTATGT GTAGAAACCAAGTGA-3', 5'-TCTTAGAACTTATGATACACCCAGTTG-3'; HEST127: product 218 bp, 5'-GGGAAAAGATGGCAAGGTTA-3', 5'-AATATATTCGAGTCACATCA TGACA-3'; HEST146: product 171 bp, 5'GATGCCATCACTCAAACTAGACC-3', 5'-GACGTCCTATGCAGGCATTT-3'; HEST147: product 205 bp, 5'GGGGATTCGAGGTTC CTGTA-3', 5'-CATTTCAAG-GCACAATTTTAATAGC-3'; HEST149: product 196 bp, 5'-CCCAGGCTGAAGTGTAGTGA-3', 5'-CATTTACAATGGTA-CAAGGAGCA-3'. The universal reference RNA sample was obtained from Stratagene (La Jolla, CA, http://www.stratagene.com), and somatic tissue RNA samples were obtained from Clontech.
Orientation-Specific RT-PCR
To detect the NATs for POU5F1, NANOG, LIN28, TALE, TERF1, and TERA, orientation-specific first-strand cDNA synthesis was carried with the appropriate sense primers. Thereafter, Superscript II RT was heat-inactivated at 95°C for 15 minutes. PCR was performed with 3 µl of the 20-µl first strand mix as described. Control experiments without reverse transcription (RT controls) for each of the three antisense primers were performed to detect genomic DNA contamination. The primers used were as follows. POU5F1 NAT: product 184 bp, 5'-AGTTTGTGCCAGGGTTTTTG-3', 5'-TGTGTCCCAG-GCTTCTTTATTT-3'; NANOG NAT: product 278 bp, 5'-TCGGTATTGTTTGGGATTGG-3', 5'-TCATCGAAAC-ACTCGGTGAA-3'; LIN28 NAT: product 178 bp, 5'-GGAGGCCAAGAAAGGGAATA-3', 5'-CCGCCCCATA-AATT CAAGAT-3'; TALE NAT: product 80 bp, 5'-TTTTCA-GACTGTGCAATA CTTAGAGAA-3', 5'-TTAGACAG-TATGTGGGCATCC-3'; TERF1 NAT: product 169 bp, 5'-TGCGGAGT AGATGAGATGGA-3', 5'-AAGGCAATG-GAAAACAGGTAAA-3'; TERA NAT: product 131 bp, 5-TTT-TGGCTGCAGTATTGGTG-3', 5'-CATCCTACAGGC-AAAGAGAGG-3'.
| RESULTS |
|---|
|
|
|---|
Of the 200 HES3 orphan SAGE tags that were selected for rSAGE conversion (supplemental online Table 1), 168 (84.0%) yielded PCR amplification products (Fig. 2A
). The conversion rate of orphan LongSAGE tags into longer 3' cDNA fragments was much higher (93.4%) than that of the SAGE tags (69.2%). We attributed these improvements to the availability of additional sequences from the LongSAGE tags for the design of TSRPs, as well as better-designed universal primers (rSAGEF1 and rSAGER1) in our strategy (Fig. 1
). In particular, we found the universal M13 primer used as the antisense primer in the original rSAGE strategy [41, 42] was unsatisfactory for rSAGE because of its low Tm.
|
50 tpm). We also managed to obtain genuine rSAGE products for SAGE tags with frequencies of as low as 5 tpm, which is equivalent to the detection of a singleton in the HES3 SAGE library (HESTs 79, 147, and 174; supplemental online Table 1). In conclusion, it appears that our modified rSAGE protocol has some improvements over the original rSAGE protocol [41, 42] and was as efficient as GLGI-SAGE [43, 44] and GLGI-MPSS [47]. From the 168 SAGE tags that yielded PCR amplification products, a total of 196 rSAGE products were cloned and sequenced. Of these, 148 (75.5%) were confirmed as specific rSAGE products following DNA sequencing, BLAST and BLAT confirmation (supplemental online Table 2). These 148 rSAGE 3' cDNA fragments have been submitted to GenBank (accession numbers DN604327 [GenBank] DN604453 [GenBank] ), and we will refer to these cDNA sequences hereafter as HESTs. When TSRPs were designed using the LongSAGE tags, the overall amplification specificity reached 80.5% compared with GLGI-SAGE specificities that varied between 60% for low-copy SAGE tags and 80% for high-copy SAGE tags [43, 44]. Many of the nonspecific rSAGE fragments lacked a poly(A) tract and the rSAGER1 primer and were generated mainly because of mispriming at the 3' ends (supplemental online Table 3). Finally, although the hESC lines used in our earlier SAGE study [12] and for the present rSAGE library construction were grown on MEF feeders, we did not find contaminating murine RNA transcripts a significant problem in our 3' rSAGE conversion attempts.
Overall, 16.0% of rSAGE reactions failed to give distinct amplification products. Taken together with the nonspecific rSAGE results, our main conclusion is that a SAGE tag does not always provide an ideal sequence for the design of thermodynamically favorable TSRPs for the efficient amplification of 3' cDNA by rSAGE. Thus, orphan SAGE tags that were AT-rich or contained sequences that were self-complementary often failed to generate specific rSAGE 3' cDNA fragments. Although it is possible that when the expression level of targeted templates is very low, partial annealing of the TSRPs with other highly expressed templates may result in nonspecific amplification [44], the availability of additional sequences through the generation of LongSAGE or even SuperSAGE tags [22] would allow most of the remaining orphan SAGE tags to be converted into longer 3' cDNA fragments for gene identification.
Analysis of 3' HESTs Generated from HES3 Orphan SAGE Tags
The size distribution of the 148 HESTs ranged from 36 to 538 bp, with 56.7% of them longer than 100 bp, which matched well to the reported data from GLGI-SAGE studies [18, 19, 43, 44]. A small number of the TSRPs [14] gave two or more distinct rSAGE bands. The majority of them were mapped to distinct transcripts (HEST31, 52, 53, 65, 98, 99, 148, and 170; supplemental online Table 2), whereas those for HEST126 and 141 were the result of alternative polyadenylation sites. Previous GLGI-SAGE reports have relied on BLAST searches to determine the identity of the 3' cDNA fragments [18, 19, 43, 44]. We used both BLAT and BLAST searches to establish the identity of rSAGE cDNA sequences (Fig. 3A
). Indeed, the BLAT transcript viewer made it easier to visualize and quickly identify NATs, novel introns, and new splice variants of known transcripts and to confirm SNPs within the SAGE tags. For several SAGE tags, rSAGE extension resulted only in poly(A) sequences, as a result of the NlaIII site occurring just adjacent to the poly(A) tract, and would require the use of a different tagging enzyme to reveal their true identity. More importantly, our rSAGE results have clearly identified 59 of these rSAGE 3' cDNA fragments as novel rSAGE 3'ESTs and 30 NATs, all of which are identified for the first time (Fig. 3A
).
|
|
Interestingly, four HESTs (112, 120, 128, and 170) showed high sequence similarity to the WiCell hESC ESTs [13]. HEST2 and 146, classified as novel sequences, did not overlap with known hESC ESTs but mapped to genomic regions proximal to chromosomal sites where several WiCell hESC ESTs appear to be transcribed from. Obtaining 3' cDNA sequences that matched WiCell ESTs [13] indicated that our modified rSAGE protocol was working well. In addition, our RT-PCR data also confirmed that the expression of HEST120, 127, and 146 were confined to hESCs, although HEST120 (and to a lesser extent HEST127) was also detected in the fetal brain (Fig. 3B
). Unfortunately, although these ESTs are highly restricted in their expression to hESCs, as demonstrated either by RT-PCR or by their representation in human ESC SAGE libraries [12], their exact functional role is unknown.
The impact of SNPs on the correct assignment of SAGE tags to specific transcripts [29] is also illustrated by our rSAGE results. For instance, HEST49 matched the CHD8 with almost 100% sequence similarity and is the result of an SNP that created a new NlaIII restriction site upstream of the AATAAA polyadenylation site. The full-length cDNA sequence of CHD8 is 8,160 bp long, and this SNP would generate the C-most SAGE tag. The original C-most SAGE tag for CHD8 is GGC-CCCATTG (nts 73117320), which is also represented in the HES3 SAGE library (5 tpm). We also detected an SNP within the C-most SAGE tag of GJA1, which encodes the gap junction protein connexin 43. The putative C-most SAGE tag is TGT-TCTGGAG (nts 29162925). The rSAGE conversion of the orphan SAGE tag, TGTTTTGGAG, resulted in HEST113, which displayed a 97% sequence similarity to the 3' terminal region of the GJA1 coding region. Careful examination of corresponding EST and genomic DNA sequences indicated that this orphan tag most likely represented an SNP in the canonical GJA1 SAGE tag and not the hypothetical protein FLJ10407 as suggested by the predicted tag-to-gene mapping of SAGEGenie. The GJA1 SNP was verified using 6-carboxyfluorescein (FAM)- and VIC-labeled Taqman probes that were specific to the polymorphism (Fig. 3C
).
The generation of longer 3' cDNA sequences by rSAGE has also helped to resolve some of the ambiguities in tag to gene assignments, at least in HES3 cells. For example, HEST119 (AGTGAGGATA) matched the hypothetical protein FLJ35155 (C3orf21), which is restricted in expression to hESC lines and tissues of cancerous origin. In addition, the SAGE tag for HEST114 (CATCCAAAAA) was incorrectly assigned to NPY and CEP2 by SAGEGenie and SAGEMap, respectively. Instead, rSAGE conversion confirmed that HEST114 matched to the hypothetical protein FLJ10884, a hypothetical protein restricted in its expression to the testis, placenta, and hESC lines, instead of NPY.
Antisense Transcription in hESCs
BLAT and BLAST searches revealed that many of the HESTs were the products of antisense transcription. Interestingly, cis-NATs for several important ES-specific genes, such as NANOG (HEST16), POU5F1 (HEST88), and LIN28 (HEST168), were identified by our rSAGE results (supplemental online Table 2). Analyzing the chromosomal location of these cis-NATs and the corresponding sense tags from the HES3 library revealed the presence of sense-antisense (SA) gene pairs [34, 35, 38]. Table 2
is a list of 18 SA SAGE tag pairs and the corresponding antisense HESTs that were experimentally obtained with rSAGE. Although several SA SAGE tag pairs can be mapped in trans to remote genomic loci, other pairs mapped in cis on contiguous oppositely oriented DNA strands (Fig. 4A
). Besides POU5F1, NANOG, and LIN28, a number of other highly expressed hESC-specific genes, like TGIF/TALE (HEST109), ERH (HEST151), TERA (HEST155), and TERF1 (HEST193.2), also expressed cis-NATs. Furthermore, the representation of many of these co-expressed SA SAGE tag pairs decreased upon differentiation of the hESCs (Table 2
). The SAGE tags for NANOG (TCATTACGAT) and POU5F1 (ATGTGGGATT) cis-NATs were found only in hESC SAGE libraries, indicating that the expression pattern of cis-NATs for NANOG and POU5F1 are even more restricted than their sense transcript counterparts.
|
|
HEST115 and 168 appeared to represent spliced SA transcripts from ILF2 and LIN28, respectively. Nucleotides (nts) 142 of HEST115 matched the ILF2 coding region in the antisense orientation (Chr1[+]: 150447872150447913), whereas nts 24222 matched the sense orientation (Chr1[[: 150447587150447785). Likewise, nts 1133 of HEST168 matched the LIN28 coding region in the antisense orientation (Chr1[]: 2643991826440050), whereas nts 131171 matched the sense orientation (Chr1[+]: 2644031026440350). This novel sense-antisense RNA hybrid structure is originally reported for the cardiac troponin I gene in rat hearts [50]. The structure the cardiac troponin I "hybrid RNA," which the authors themselves have tentatively concluded to be formed from the transcription of the troponin mRNA in the cytoplasm, is very similar to what we have described for ILF2 and LIN28. The functional significance of these hybrid RNAs is currently unknown.
| DISCUSSION |
|---|
|
|
|---|
Although the human transcriptome is necessarily less complex than the human genome, it is quite apparent that transcriptome complexity has been underestimated [34, 35, 38, 44]. Noncoding RNA, regulatory RNA, NATs, and novel splice variants add to the multifaceted nature of the transcriptome. In the present study, we have used a modified rSAGE strategy to convert selected orphan SAGE tags from hESCs into longer 3' cDNAs. It has facilitated the identification of isoforms due to splicing, alternative polyadenylation and SNPs. A large number of novel hESC-specific genes have also been identified, indicating that the hESC transcriptome is indeed poorly characterized [12]. This is also the first description of cis-NATs from several key pluripotent genes that are involved in the maintenance of hESC self-renewal, suggesting that SA transcript pairing might be a key regulatory mechanism [31].
A recent study reported that 41.5% of SA transcript overlaps occurred in the last exon or untranslated region (UTR) of the coding sequence [34]. We have found that overlaps between the cis-NAT of LIN28, NANOG, and POU5F1 and their corresponding sense transcripts occurred in the 3' UTR of the coding sequence as well. Although the exact significance of this positional overlap is unknown, UTRs are believed to contribute toward the localization, stability, and translational control of mRNA transcripts. Indeed, the finding that >30% of vertebrate mRNAs show orthologue-specific conservation of 3' UTRs suggests a possible functional or regulatory role for UTR sequences [56]. The recent finding that many of the human SA gene pairs are also detected in mouse, rat, and fugu and are probably conserved throughout the course of vertebrate evolution [57] lends some support to the notion that cis-NATs are not due to a "leakage" of the transcriptional apparatus but rather that their abundance is the result of active transcription. For POU5F1 and NANOG, we have ruled out the possibility that their cis-NATs are due to the insertion of L1 retrotransposon [58]. However, because there are several pseudogenes for POU5F1 and NANOG, the possibility of trans-NATs from these genomic loci remains to be determined.
Several reports have hinted that the contribution of NATs in the human genome has been underestimated [34, 35] and that up to 25% of human transcripts might form natural SA pairs. Although initial studies indicated that there was no correlation between NATs and their function or localization [34], a more recent survey of SA pairs confirmed that they are predominant for genes involved in translation regulator activity, DNA damage response, and cell growth, whereas non-SA transcripts were found to have a significantly different functional distribution [35]. Several of the human ES NATs and SA gene pairs we have identified are representative of genes that code for transcription factors and RNA-binding proteins, whereas SA gene pairs for ubiquitously expressed genes, such as glyceraldehyde-3-phosphate dehydrogenase and ACTB, were not present in the HES3 SAGE library. The fact that SA transcripts have a significantly higher probability of involvement in translation regulator activity and are more frequently located in both the nucleus and cytoplasm [35] is compatible with a role in antisense-mediated gene regulation occurring in both the nucleus and cytoplasm and at the transcription and translation levels [31].
Although certain human miRNAs (miR-1 and miR-124) have been recently demonstrated to influence and define tissue-specific gene expression profiles in HeLa cells [59], the functional roles of the cis-NATs in similar context have not been previously reported. Since cis-NATs are also capable of regulating gene expression through RNA masking, transcriptional or RNA interference [31, 32], the identification of cis-NATs for POU5F1 and NANOG prompted us to determine whether cis-NATs might be commonly expressed for other key regulators that are involved in the maintenance of pluripotency in hESCs. Both the mouse and the human SAGE libraries were searched for the presence of SAGE tags representing the cis-NATs for ES-specific genes [12, 60]. We failed to find SAGE tags representing UTF1, REX1, LEFTB, and GDF3 cis-NATs in human and mouse SAGE libraries. However, we detected cis-NATs for a number of key ES-specific genes (e.g., FGFR1, FGFR2, TDGF1, SOX2) in HES3 and SAGE libraries constructed from other hESC lines (Table 3
). In addition, SAGE tags representing pou5f1, nanog, tera, and lin28 were also detected in mouse embryonic stem cells (mESCs). In summary, cis-NATs for a number of ES-specific genes, such as POU5F1 and NANOG, were shown to be expressed in both hESCs and mESCs, and it is possible that some of these cis-NATs might have a role in maintaining the "stemness" phenotype of ES cells.
|
| ACKNOWLEDGMENTS |
|---|
|
|
|---|
DISCLOSURES
The authors indicate no potential conflicts of interest.
| REFERENCES |
|---|
|
|
|---|
This article has been cited by other articles:
![]() |
S. Shin, Y. Sun, Y. Liu, H. Khaner, S. Svant, J. Cai, Q. X. Xu, B. P. Davidson, S. L. Stice, A. K. Smith, et al. Whole Genome Analysis of Human Neural Stem Cells Derived from Embryonic Stem Cells and Stem and Progenitor Cells Isolated from Fetal Tissue Stem Cells, May 1, 2007; 25(5): 1298 - 1306. [Abstract] [Full Text] [PDF] |
||||
![]() |
S. Assou, T. Le Carrour, S. Tondeur, S. Strom, A. Gabelle, S. Marty, L. Nadal, V. Pantesco, T. Reme, J.-P. Hugnot, et al. A Meta-Analysis of Human Embryonic Stem Cells Transcriptome Integrated into a Web-Based Expression Atlas Stem Cells, April 1, 2007; 25(4): 961 - 973. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. C. Cheng, K. M. Sakamoto, E. M. Horwitz, S. L. Karsten, L. Shoemaker, H. I. Kornblumc, and P. Malik Report on the Workshop "New Technologies in Stem Cell Research," Society for Pediatric Research, San Francisco, California, April 29, 2006 Stem Cells, April 1, 2007; 25(4): 1070 - 1088. [Abstract] [Full Text] [PDF] |
||||
![]() |
A. Werner, G. Schmutzler, M. Carlile, C. G. Miles, and H. Peters Expression profiling of antisense transcripts on DNA arrays Physiol Genomics, February 12, 2007; 28(3): 294 - 300. [Abstract] [Full Text] [PDF] |
||||
| ||||||