Strand-specific libraries for high throughput RNA sequencing (RNA-Seq) prepared without poly(A) selection

Background High throughput DNA sequencing technology has enabled quantification of all the RNAs in a cell or tissue, a method widely known as RNA sequencing (RNA-Seq). However, non-coding RNAs such as rRNA are highly abundant and can consume >70% of sequencing reads. A common approach is to extract only polyadenylated mRNA; however, such approaches are blind to RNAs with short or no poly(A) tails, leading to an incomplete view of the transcriptome. Another challenge of preparing RNA-Seq libraries is to preserve the strand information of the RNAs. Design Here, we describe a procedure for preparing RNA-Seq libraries from 1 to 4 μg total RNA without poly(A) selection. Our method combines the deoxyuridine triphosphate (dUTP)/uracil-DNA glycosylase (UDG) strategy to achieve strand specificity with AMPure XP magnetic beads to perform size selection. Together, these steps eliminate gel purification, allowing a library to be made in less than two days. We barcode each library during the final PCR amplification step, allowing several samples to be sequenced in a single lane without sacrificing read length. Libraries prepared using this protocol are compatible with Illumina GAII, GAIIx and HiSeq 2000 platforms. Discussion The RNA-Seq protocol described here yields strand-specific transcriptome libraries without poly(A) selection, which provide approximately 90% mappable sequences. Typically, more than 85% of mapped reads correspond to protein-coding genes and only 6% derive from non-coding RNAs. The protocol has been used to measure RNA transcript identity and abundance in tissues from flies, mice, rats, chickens, and frogs, demonstrating its general applicability.


Background
Strand-specific RNA sequencing (RNA-Seq) provides a powerful tool for transcriptome analysis. Besides measuring transcript abundance across the entire transcriptome, RNA-Seq facilitates de novo transcript annotation and assembly, quantification of splice site usage, and identification of mutations or polymorphisms between samples [1][2][3]. Ribosomal RNAs compose an overwhelming fraction of the total RNA population (>70%) and can occupy most of the sequencing space, leaving little room for investigating other transcripts [4]. The most widely used strategy employs poly (A) selection to enrich RNA polymerase II transcripts, but this strategy cannot be used to study RNAs lacking poly(A) tails or precursor transcripts processed into fragments that have lost their poly(A) tails; for example, 7SL RNA, 7SK RNA, the 5 0 fragment of Argonaute cleavage products, processed products of PIWI-interacting RNA (piRNA) precursors, and long non-coding RNAs such as Kcnq1ot1 in mammals [5]. Another strategy removes rRNA by hybridization while retaining other non-adenylated RNAs for sequencing [5].
Although RNA can be sequenced directly, without conversion to cDNA, current high throughput technologies for direct RNA sequencing have short read lengths (25 to 55 nt; median 33 nt) and high error rates (4%) [6,7]. Therefore, current strategies for transcriptome analysis all typically convert RNA to cDNA before sequencing [8][9][10][11], notwithstanding the artifacts that may result from template switching or structural RNA selfpriming [8][9][10]. The deoxyuridine triphosphate (dUTP) method, one of the leading cDNA-based strategies, provides excellent library complexity, strand specificity, coverage evenness, agreement with known annotation and accuracy for expression profiling [12]. In this method, the RNA is first reverse transcribed into cDNA:RNA using random primers. To synthesize the second cDNA strand, dUTP instead of deoxythymidine triphosphate (dTTP) is used, marking the second cDNA strand for subsequent degradation with uracil-DNA glycosylase (UDG) to preserve strand information [13][14][15].
Here, we describe a protocol for preparing strandspecific RNA-Seq libraries that combines rRNA removal using the Ribo-Zero rRNA Removal Kit (Epicentre, Madison, WI, USA) and the dUTP method for ensuring strand specificity ( Figure 1). Our protocol shows advantages in time saved, cost and performance (Table 1). We replace laborious, time-consuming gel purification steps with AMPure XP beads (Beckman Coulter, Brea CA, USA), whose size-selectivity and efficiency of DNA recovery allow the use of small amounts of starting RNA [16,17]. The high sequencing depth of the Illumina HiSeq 2000 platform (Illumina, San Diego, CA, USA) can easily generate >170 million reads per lane, allowing multiple barcoded samples to be pooled and sequenced in a single lane. One common method to index a library is adding barcodes during adapter ligation, so that the first five or six nucleotides of each read is the barcode. However, this strategy sacrifices read length, can increase the error rates at the 5 0 or 3 0 ends of reads [14], can perturb the calibration of the Illumina base calling algorithm (the HiSeq 2000 platform uses the first five nucleotides for calibration), and may lead to differential ligation efficiency and specificity among barcoded samples. Introducing barcodes during the final PCR amplification ( Figure 1) bypasses these problems. The barcodes are then read using a separate primer and additional sequencing cycles after the insert has been sequenced ( Figure 2). We modified the Illumina Multiplexing Sample Preparation Oligonucleotide Kit and used 12 barcoded primers to index 12 libraries at the final PCR step (Figure 1). Our protocol requires only 1 to 4 μg total RNA as starting material and takes no more than two days to complete.   Luria-Bertani (LB) agar kanamycin plates: 1% (w/v) tryptone, 0.5% (w/v) yeast extract, 1% (w/v) NaCl, 1.5% (w/v) agar and 50 μg/ml kanamycin.

Figure 2
Library and sequencing primer sequences.

Equipment
Water bath or heat block. Magnetic stand for 1.5 ml centrifuge tubes. Bench top centrifuge for 1.5 ml centrifuge tubes (17,000 × g required).

DNA oligonucleotides
Multiplexing adapters Procedure rRNA depletion High quality total RNA is essential for efficient rRNA removal. For example, in our hands, Drosophila RNA subjected to repeated freezethawing or treated with DNase cannot be efficiently depleted of ribosomal RNA.  3. Place the tube in the magnet stand for 5 minutes until the supernatant appears clear. 4. Discard the supernatant. 5. Keep the tube in the stand and add 180 μl of 70% (v/v) ethanol into the tube without disturbing the beads. 6. Wait for 30 seconds and discard the ethanol supernatant. 7. Repeat steps 5 and 6. 8. To remove any ethanol remaining on the sides of the tube, centrifuge the tube at 1000 × g for 1 minute. 9. Place the tube in the magnetic stand for 30 seconds, and then remove any residual ethanol using a 10 μl pipette. 10. Add the specified volume of elution buffer to the beads and pipette to mix. 11. Wait 3 minutes, and then place the tube in the magnetic stand for 3 minutes. 12. Use a 10 μl pipette to carefully transfer the eluted DNA to a new tube (sacrifice 1 to 2 μl to avoid carrying over any beads).

Quality control
Based on our experience, a good library will range in size from 300 to 500 bp, including 122 bp from the PCR primers plus 200 to 350 bp from the RNA inserts. Bioanalyzer analysis should show a peak at 320 to 330 bp ( Figure 3A). Small-scale colony sequencing can reveal the size and sequence of the inserts and barcodes for a small but representative sample of the library. When preparing libraries for the first time, sequencing 10 to 20 colonies per library serves to validate successful library construction. The PCR amplification products should be approximately 600 bp (244 bp from the pCRII-Blunt-TOPO vector, 122 bp from the PCR primers, plus the RNA insert; Figure 3B). Expect one or two colonies to lack inserts, giving a 366 bp PCR product ( Figure 3B). Of the remaining 15 to 18 successfully sequenced colony PCR products, one or two may derive from rRNA.

High throughput sequencing
The number of samples mixed in one sequencing lane depends on the genome size of the organism and the purpose of the research. To study low abundance RNAs from the repetitive region of the Drosophila genome, we usually pool four barcoded samples in a single lane of the HiSeq 2000 instrument, and sequence the libraries as 100 nt paired-end reads. We typically obtain >170,000,000 fragments per lane. For example, in one experiment, we obtained 175,991,972 fragments (for paired-end sequencing, each fragment has two reads, a total of 351,983,944 reads). Among them, 349,247,868 (99.2%) reads were successfully sorted by the barcodes.
Using TopHat [20][21][22] to map reads to the fly genome (parameters: tophat -i 50 -p 24 -library-type fr-firststrand -G gene.gtf -coverage-search -segment-length 25 -o out-put_directory_name), we typically achieve 90% mappability. For example, for a typical library, 91.7% of reads mapped to the fly genome. Among the mapped reads, only 4.03% were singletons (that is, only one of the paired reads in the fragment mapped); both reads mapped for the rest. Finally, more than 85% of mapped reads corresponded to protein-coding genes, and only 6.20% derived from non-coding RNAs such as rRNA, tRNA, snRNA or snoRNA. We have used this protocol to produce libraries of similar quality from wild-type and mutant mouse tissues, as well tissues from wild-type rat, chicken, and frog, demonstrating its general applicability. Competing interests PDZ is a cofounder and member of the scientific advisory board of Alnylam Pharmaceuticals, Inc, and a member of the scientific advisory board of Regulus Therapeutics, LLC.
Authors' contributions ZZ and PDZ planned the experiments and wrote the manuscript. ZZ performed the experiments. All authors read and approved the final manuscript.