RNA Sequencing Quality Control

Quality Control

At A Glance

01 Overview of RNA-Seq QC 02 RNA QC 03 Sequencing QC 04 Raw Read Data QC 05 Alignment QC 06 Gene Expression Data QC 07 Statistical and Computational QC

In the field of genomics, RNA-Seq has revolutionized the study of gene expression by providing a comprehensive and quantitative snapshot of the transcriptome. However, the success of RNA-Seq studies rely heavily on the quality control (QC) measures implemented throughout the experimental workflow. This article aims to provide a complex and professional overview of RNA-Seq QC, emphasizing its significance in producing reliable and accurate results.

Overview of RNA-Seq QC

Quality control in RNA-Seq involves a series of measures to assess and address the challenges arising from biological and technical variability. These challenges can lead to biased or erroneous results if not adequately controlled. By implementing QC measures at various stages of the RNA-Seq workflow, researchers can mitigate potential issues and enhance data quality.

The overall workflow of RNA-seq QC. (Sheng et al., 2017)

RNA QC

A crucial aspect of RNA-Seq QC begins before sequencing, during sample collection, and handling. The quality of the biological material significantly impacts downstream analysis. Therefore, best practices for sample collection and preservation must be followed meticulously. Additionally, RNA extraction and purification methods should be carefully selected, considering their potential limitations. Assessing RNA integrity and purity ensures that only high-quality samples proceed to the library preparation stage.

RNA quality can be evaluated by measuring the ratio of 28s:18s ribosomal RNAs (rRNAs), which indicates RNA integrity. Higher RIN values correspond to better RNA integrity and are considered desirable for downstream analysis. However, formalin-fixed paraffin-embedded (FFPE) tissues, known for low-quality RNA, often yield RIN values ranging from 2 to 5. Unfortunately, there is no reliable assay yet to predict the success of FFPE tissue-based RNA-seq experiments, and success is assessed based on factors like storage time, conditions, fixation time, and specimen size.

Sequencing QC

The choice of sequencing technology and platform plays a pivotal role in RNA-Seq QC. Different platforms have varying error profiles, read lengths, and sequencing depths. Researchers must carefully consider these factors based on their specific study objectives. During sequencing, QC measures focus on evaluating the sequence quality, such as base call quality scores and per-base sequence quality metrics. Additionally, assessing sequencing depth and coverage uniformity ensures adequate representation of the transcriptome.

Raw Read Data QC

Once the sequencing is complete, post-sequencing QC steps are crucial for data cleaning and analysis. Raw sequencing data undergoes quality control procedures, including adapter trimming and removal of low-quality bases. Contaminating reads are filtered out to prevent false positives or erroneous results. Alignment and mapping QC evaluate the efficiency and quality of the alignment process, with a particular focus on detecting batch effects or sample mix-ups. Gene expression quantification is assessed to ensure accurate measurement and identification of potential outliers.

Raw Read Data QC. (Sheng et al., 2017)

Adapter Trimming: Raw reads often contain adapter sequences that were used during library preparation. These adapters are removed to avoid interference with downstream analysis.
Removal of Low-Quality Bases: Bases with low quality scores, indicating a higher likelihood of errors, are filtered out. This helps to improve the overall data quality and accuracy of subsequent analyses.
Contaminating Read Filtering: Any reads originating from contaminants or foreign species are identified and filtered out. This step is crucial to prevent false positives and ensure the reliability of the data.
Alignment and Mapping QC: The alignment process is assessed to evaluate its efficiency and accuracy. Special attention is given to detecting batch effects or sample mix-ups, which could impact the interpretation of the results.
Gene Expression Quantification: Accurate quantification of gene expression is essential in RNA-seq analysis. QC measures are employed to ensure the reliability of gene expression measurements and to identify any potential outliers in the data.
Nucleotide Distribution: It helps identify outlier samples with unusual nucleotide distribution patterns. In an ideal scenario, the distribution of the four nucleotides (A, T, C, and G) should remain stable across all cycles. However, significant fluctuations in nucleotide distribution indicate lower data quality. This can occur when a particular sRNA is overrepresented, causing the distribution of bases at each position to reflect the sequence of that sRNA. Conversely, equal distribution of bases at each position indicates high-quality sequencing without dominance by one RNA, such as adapters.
GC Content: The expected GC content varies across species and genomic regions, and a deviation of more than 10% from the expected value may indicate contamination. For example, in total RNA-seq, the expected GC content falls between the expected GC content of long noncoding RNA (lncRNA) and coding RNA. However, it is important to note that nucleotide distribution by cycle and GC content are not reliable parameters for assessing sequencing quality in sRNA-seq data, as the library includes all species of sRNAs, including miRNAs with known overrepresentation issues.
Others: Overrepresented sequences can indicate biologically relevant RNA sequences or contamination from adapters or other sources. Adapters, which are RNA sequences ligated to sRNAs during library preparation, may be partially or completely sequenced, leading to adapter contamination. K-mer analysis, which involves assessing the abundance of all possible nucleotide combinations of a specific length (K), can reveal short duplicated sequences that may be missed by overrepresented sequence analysis targeting longer sequences.

Alignment QC

Alignment QC involves screening sequencing data after alignment to the reference genome. Alignment determines the best location for each read, minimizing mismatches. Raw data QC may not detect certain issues like capture efficiency and RNA contamination, which can be identified through alignment QC.

Capture Efficiency: it measures the percentage of sequenced reads mapped to the target region, which is typically between 50% and 80% for high-throughput sequencing. Low capture efficiency can be due to low sample quality, inefficient removal of rRNAs, or RNA contamination from another source. Low-quality samples are challenging to salvage using bioinformatics methods, so it's best to reconstruct the library.
Insert Size & Mapping Quality Score (MAPQ): BWA uses MAPQ to represent the probability of the mapped read being correct, while in RNA-seq aligners like TopHat and STAR, MAPQ indicates the uniqueness of the alignment. Insert size, often overlooked for pair-end sequencing data, reflects the length of RNA fragments. It should follow a nonnormal distribution with a peak at the targeted size and a long right tail.

Insert size distribution (Sheng et al., 2017)

Bias: Nonuniform coverage of transcripts is common in poly(A)-selected RNA-seq, where the 3' end is overrepresented. Bias at the 5' end can also occur due to fragmentation methods and library construction protocols. GC content affects sequencing coverage, and both high and low GC content can result in lower depth coverage. Biases can be corrected using specific methods.
Strand Specificity: Strand orientation in long RNA library construction is important for obtaining reads from the intended strand. A small percentage of anti-sense RNA may still be sequenced. Strand-specific library construction efficiency can be assessed using tools like RNA-SeQC.
Batch Effects: Batch effects, similar to raw data QC, can be detected in alignment QC. Associations between alignment rate, capture efficiency, and factors like machine, run, lane, and flow cell can indicate batch effects. Tools like QC3 provide functionality for batch effect detection using alignment files.

Gene Expression Data QC

Expression QC involves quality control for gene expression data obtained using tools like HTSeq and Cufflinks. Clustering analysis helps identify samples with similar gene expression patterns, but unbiased methods and data normalization are crucial. Persistent outliers may indicate misrepresentation, weak phenotypes, mislabeling, or cross-contamination, which can be validated through heterozygous genotype consistency analysis.

The success of loss or gain of gene function should be assessed, with overexpression designs producing larger fold changes. A minimum RPKM detection threshold of 1.0 is recommended to avoid misleading results. Gene detection bias in RNA-seq arises from uneven read counts, emphasizing the need for batch effect correction.

Statistical and Computational QC

Beyond the raw data analysis, statistical and computational QC methods are essential for data normalization and exploratory analysis. Normalization techniques address technical biases, such as batch effects, and ensure comparability across samples. Various visualization and exploratory analysis approaches, such as principal component analysis (PCA), heatmaps, and clustering, aid in identifying patterns and structure within the data. Detecting and addressing confounding factors, such as sample outliers and covariate adjustment, is critical to maintain data integrity.

Reference:

Sheng, Quanhu, et al. "Multi-perspective quality control of Illumina RNA sequencing data analysis." Briefings in functional genomics 16.4 (2017): 194-204.

* For Research Use Only. Not for use in diagnostic procedures.