5 编码序列识别[Coding Region Identification]

方法[Method]

使用TransDecoder(v5.5.0)软件，借助已知蛋白序列，包括UniPort SwissProt数据库、Pfam蛋白数据库，对转录本中的开放阅读框（open reading frame,ORF）进行预测，并对预测得到的ORF转换为蛋白序列用于后续蛋白功能注释。
TransDecoder (v5.5.0) software, with the information in common protein databases, including UniPort SwissProt database and Pfam protein database, predicts the open reading frame (ORF) in the transcript, and converts the predicted ORFs into protein sequence for subsequent protein function annotation.

首先TransDecoder以ORF编码规则和长度限制对潜在ORF进行筛选。之后，以最长的ORF为基础训练Markov模型，并以该模型为所有候选ORF进行打分。同时，对这些候选ORF进行同源搜索。最终的结果中，将包括与编码区具有相似序列特征或在同源搜索中具有匹配的候选ORF。
First TransDecoder screens potential ORFs based on ORF encoding rules and length restrictions. Then, the Markov model is trained based on the longest ORF, and the model is used to score all candidate ORFs. At the same time, a homology search was performed on these candidate ORFs. The final result will include candidate ORFs that have similar sequence characteristics to the coding region or have matches in the homology search.

默认情况下，TransDecoder会忽略100个氨基酸以下的ORF序列。
By default, TransDecoder will identify ORFs that are at least 100 amino acids long.

5.1 ORF识别结果[ORF recognition results]

预测结果[predict result] :
ORF位置注释：GFF3
ORF对应蛋白序列：pep ,clean pep
ORF对应cds序列：CDS , clean CDS

“clean pep/cds” 文件与pep/cds文件一致，但去除了其中的预测依据信息，便于后续blast等分析过程使用。

5.2 ORF识别数量统计[Statistics of ORF identification]

Table 5.1: Summary of samples
input seq	ORF	complete ORF
189471	53932	20075

分列信息：

Input seq：
输入序列，此处输入序列为拼接所得所有转录本的isoform序列。
Input sequence, here the input sequence is the isoform sequence of all transcripts obtained by splicing of Trinity.

ORF：
ORF总数，包括完整的ORF和不完整的ORF。
Total ORFs, including complete ORFs and incomplete ORFs.

complete ORF：
完整的ORF，在TransDecoder注释结果中，根据序列信息以及与已知蛋白序列的比对结果，除完整orf之外，还能得到部分不完整的ORF，可能对应于蛋白序列的5‘端或3’端片段。
Complete ORF. In the TransDecoder annotation results, according to the sequence information and the comparison result with known protein sequences, in addition to the complete orf, some incomplete ORFs can be obtained, which may correspond to the 5’end or 3’end fragment of the protein sequence.