10 差异表达

本章节中，对各分组之间的差异表达的转录本和基因进行鉴定。筛选出差异表达基因以供进一步分析。

前述转录本拼接过程中，Trinity软件依照reads证据及序列之间的重叠划分出宽泛的”gene“，每个“gene”可能对应多个转录本。由于旁系同源基因、基因重叠、测序错误、数据量等因素的影响，拼接所得的转录本可能存在一定数量的“假基因”。

为了更好地进行差异表达基因分析。首先将拼接得到的转录本以Corset工具重构为cluster，每个cluster表示一个基因，替代转录组拼接过程中定义的宽泛的“gene”概念。

Corset是Trinity官方推荐的软件。其在Trinity拼接基础上，根据转录本间Shared Reads将转录本聚合为许多cluster，再结合不同样本间的转录本表达水平及H-Cluster算法，将样本间有表达差异的转录本从原cluster分离，建立新的cluster，最终每个cluster被定义为“Gene”。该方法聚合冗余转录本，并提高差异表达基因的检出率。

方法[Method]

以salmon(v0.15.0)对转录本进行定量，使用”–dumpEq“参数。通过Corset(v1.09)对salmon结果中的eq_classes.txt文件进行处理，得到cluster及cluster对应的reads数量矩阵。以EdgeR(v3.28.0)进行差异表达转录本和差异表达基因鉴定，丰度过滤参数均采用：至少在2个样本中CPM大于1。 Quantification of transcript got by salmon(v0.15.0),with parameter”–dumpEq“.Corset (v1.09) was used to process the”eq_classes.txt" file in the salmon result to obtain Corset’s cluster and a corresponding counts of reads matrix.Edger (v3.28.0) was used to identify the differentially expressed transcripts and genes. The abundance filter parameters were: CPM was greater than 1 in at least two samples.

10.1 差异表达计算结果文件格式说明

Table 10.1: [file format of DEGs result]
	sampleA	sampleB	logFC	logCPM
TRINITY_DN3_c0_g1_i2	green	yellow	-18.87	12.526
TRINITY_DN10_c0_g1_i9	green	yellow	17.49	11.309
TRINITY_DN10809_c1_g1_i1	green	yellow	-17.24	10.828
TRINITY_DN30_c8_g1_i1	green	yellow	-16.50	10.057
TRINITY_DN51_c0_g1_i4	green	yellow	16.14	10.003
TRINITY_DN1217_c0_g1_i2	green	yellow	-16.05	9.588
TRINITY_DN70385_c0_g1_i1	green	yellow	-15.57	9.085
TRINITY_DN51_c0_g1_i9	green	yellow	-15.51	9.025
TRINITY_DN59_c0_g1_i10	green	yellow	15.42	9.305
TRINITY_DN18_c0_g2_i1	green	yellow	-15.37	8.883

分列信息

转录本/基因编号;
sampleA/B：参与比较的样本组A和样本组B，每行为执行样本组A/样本组B的差异表达比较结果；
logFC:log2(Fold Change),差异表达倍数取以2为底的对数值，如logFC=2表示差异表达2^2=4倍表达差异;
logCPM：log2(CPM),CPM均值取对数，CPM:counts per million;
Pvalue：差异表达P值;
FDR：FDR矫正后的差异表达P值，即q-value;

MA-plot

MA-plot 也称为 Bland-Altman plot ，主要应用在基因组数据可视化方面，实现数据分布情况的展示。该图将数据转换为M（对数比）和A（平均值），然后绘制这些值来可视化两个样本中测量值之间的差异。

An MA-plot is an application of a Bland–Altman plot for visual representation of genomic data. The plot visualises the differences between measurements taken in two samples, by transforming the data onto M (log ratio) and A (mean average) scales, then plotting these values.

每个点代表一个基因。x轴的计算方法\(log_2(\frac{TMM_A+1}{TMM_B+1})\)，y轴的计算方法\(\frac{1}{2}[log_2(counts_A+1)+log_2(counts_B+1)]\)。q-value<0.001标记为红色。

Each point stands for a gene. The x-axis is calculated as \(log_2(\frac{FPKM_A+1}{FPKM_B+1})\), while the y-axis is calculated as \(\frac{1}{2}[log_2(FPKM_A+1)+log_2(FPKM_B+1)]\).
Which q-value<0.001 are marked in red.

火山图Volcano plot

火山图 是一类散点图，可反映大量重复数据的集中变化情况。它绘制了表达倍数和显著性之间的关系。

A volcano plot is a type of scatter-plot that is used to quickly identify changes in large data sets composed of replicate data. It plots significance versus fold-change on the y and x axes, respectively.

每个点代表一个基因。x轴为 \(log_2( \frac{TMM_A+1}{TMM_B+1})\), y轴为\(log_{10}(FDR)\)。颜色使用与 MA-plot 一致。

Each dot stands for a gene. The x-axis is calculated as \(log_2( \frac{TMM_A+1}{TMM_B+1})\), and y-axis is calculated as \(log_{10}(FDR)\).
Colors are used the same way as that of MA-plot.

10.2 转录本差异表达

10.2.1 转录本差异表达计算结果

以p < 0.001,Fold change > 4 为标准，筛选差异表达转录本，得到差异表达转录本共18255个：

差异表达转录本在各样本中的表达丰度：counts matrix
差异表达转录本在各样本中表达丰度的归一化矩阵：centered matrix

10.2.2 差异表达转录本聚类

选取P<0.001，4倍差异表达(fold change>4)的转录本，绘制样本-转录本heatmap图。图中展示差异表达转录本的总体情况。图中每一行代表一个转录本isoform，每一列代表一个样本，着色代表该转录本在该样本中的丰度。

10.2.3 基于差异表达转录本评价样本相似性

选取P<0.001，差异表达倍数fold change>4的转录本，绘制样本-样本相关系数heatmap图。图中展示样本在差异表达转录本方面的相关性。图中行列均为样本，着色代表样本间的相关性。

10.2.4 green_vs_yellow

基因表达差异计算结果[DEGs result] : DEGs result

MA图

valcano图

10.3 Corset软件处理

cluster与转录本的对应关系表 : transcript map to clusters
The table of mapping between transcript and clusters.
所有样本全部clusters的reads数量表 : counts for clusters
The table for counts of reads across all clusters and all samples.

输入isoform数量：189471

各样本中总reads数大于10的iosform数量：72304

输出cluster数量： 61112

10.4 基因差异表达

10.4.1 基因差异表达计算结果

以p < 0.001,Fold change > 4 为标准，筛选差异表达基因，得到差异表达基因共7185个。

差异表达转录本在各样本中的表达丰度：counts matrix
差异表达转录本在各样本中表达丰度的归一化矩阵：centered matrix

10.4.2 差异表达基因聚类

选取P<0.001，4倍差异表达(fold change>4)的基因，绘制基因-样本heatmap图。图中展示差异表达转录本的总体情况。图中每一行代表一个基因cluster，每一列代表一个样本，着色代表该基因在该样本中的丰度。

10.4.3 基于差异表达基因评价样本相似性

10.4.4 green_vs_yellow

基因表达差异计算结果[DEGs result] : DEGs result

MA图

valcano图