3 质量控制[Quality control]

质量分数Phred Quality Score

Phred Quality Score(Q) 可以衡量测序过程中read每个碱基的质量。\(P\) 代表该碱基被测序错误的概率,\(Q\) 是与错误概率\(P\)呈对数相关的属性。 公式为 :
Phred quality score is a measure of the quality of the identification of the nucleobases generated by automated DNA sequencing. The Phred quality scores \(Q\) are defined as a property which is logarithmically related to the base-calling error probabilities \(P\) . The formular is :

\[Q=-10 log_{10}{P}\]

下表总结了各样本的测序质量分数。
The following table summarize the sequencing quality scores for all samples.

Table 3.1: [Summary of sequencing quality by samples]
data total.bases q20.bases q30.bases q20.percents q30.percents sample
green_0.R1.fastq 1.179e+09 1.152e+09 1.104e+09 97.66 93.65 green_0
green_0.R2.fastq 1.179e+09 1.126e+09 1.055e+09 95.46 89.43 green_0
green_1.R1.fastq 1.180e+09 1.152e+09 1.105e+09 97.66 93.65 green_1
green_1.R2.fastq 1.180e+09 1.126e+09 1.055e+09 95.46 89.44 green_1
green_2.R1.fastq 1.180e+09 1.152e+09 1.105e+09 97.66 93.64 green_2
green_2.R2.fastq 1.180e+09 1.126e+09 1.055e+09 95.46 89.44 green_2
green_3.R1.fastq 1.180e+09 1.152e+09 1.105e+09 97.66 93.65 green_3
green_3.R2.fastq 1.180e+09 1.126e+09 1.055e+09 95.46 89.44 green_3
yellow_0.R1.fastq 8.834e+08 8.608e+08 8.243e+08 97.44 93.31 yellow_0
yellow_0.R2.fastq 8.834e+08 8.401e+08 7.861e+08 95.09 88.98 yellow_0
yellow_1.R1.fastq 8.833e+08 8.607e+08 8.242e+08 97.44 93.31 yellow_1
yellow_1.R2.fastq 8.833e+08 8.400e+08 7.860e+08 95.09 88.99 yellow_1
yellow_2.R1.fastq 8.833e+08 8.607e+08 8.242e+08 97.44 93.31 yellow_2
yellow_2.R2.fastq 8.833e+08 8.400e+08 7.860e+08 95.09 88.98 yellow_2
yellow_3.R1.fastq 8.833e+08 8.607e+08 8.242e+08 97.44 93.31 yellow_3
yellow_3.R2.fastq 8.833e+08 8.400e+08 7.860e+08 95.09 88.98 yellow_3

分列信息 Column Descriptions

Sample : 样本名[The sample Name/ID]
total.bases : 测序原始数据碱基总数[Total number of sequenced raw reads]
Q20(Q30).bases : 符合 Q20/Q30 的碱基数[Percentage of nucleotides passed Q30]
Q20(Q30).percents : 符合 Q20/Q30 的核苷酸百分比[Percentage of nucleotides passed Q20]

R1,R2 分别表示双端测序的两端数据[R1,R2 indicate two mates for paired-end data]
Q30: error rate=1/1000 Q20: error rate=1/100

3.1 FastQC检测[FastQC exam]

方法[Method]

使用FastQC(v0.11.8)软件对原始reads进行质量检测,并以MultiQC (v1.8)软件对各个样本文件的质量检测结果进行汇总。
Use fastqc (v0.11.8) software to test the quality of the original reads, and summarize the quality test results of each sample file with multiqc (v1.8) software.

FastQC是一款对测序数据的质量进行快速评估的软件,对每一个样本生成一个质量检测报告。当样本量较多时,逐一查看样本质量较为不便,MultiQC提供了对FastQC质量检测报告的整合,将FastQC生成的多个报告整合成一个报告,这样能方便地查看所有测序数据的质量并进行比较。
FastQC is a software to quickly evaluate the quality of sequencing data, and generate a quality test report for each sample. When the sample sets is large, it is inconvenient to check the sample quality one by one. MultiQC provides the integration of the FastQC quality test report. Multiple reports generated by FastQC are integrated into one report. This can easily view the quality of all sequencing data and Compare.

下图总结了测序质量在所有reads上的总体分布。
The figure below summarizes the overall distribution of sequencing quality across all reads。

碱基质量分布[mean quality Scores of bases in reads] : pdf

  • 更多FastQC检测结果细节 : summary