一行命令完成 Counts 到 CPM、FPKM、TPM、UQ、CUF、TMM 、CTF 的转换

原创西瓜站长生信益站 2023-09-02 08:00

收录于合集

生信益站，一点就有益！祝友友们天天开心，早日发 CNS~

今天给大家介绍一个工具——它完成了RNA-seq标准化方法的Python实现，包括：

CPM (Counts per million)
FPKM (Fragments per kilobase million)
TPM (Transcripts per million)
UQ (Upper quartile)
CUF (Counts adjusted with UQ factors)
TMM (Trimmed mean of M-values)
CTF (Counts adjusted with TMM factors)

有关方法的深入描述，请参阅：

https://github.com/genialis/RNAnorm

特征

纯Python实现（不需要R等）
与Scikit-learn兼容
命令行界面
详细的文档
验证方法实施

安装

作者建议使用 pip 安装 RNAnorm：

pip install rnanorm

快速开始

实现的方法可以从 Python 或命令行执行。

从 Python 标准化

最常见的用例是从 Python 运行标准化：

从命令行标准化

还支持从命令行进行标准化。列出可用方法和帮助：

rnanorm --help

# usage: rnanorm [-h] [--gene-lengths GENE_LENGTHS] [--annotation ANNOTATION]
#                [--gene-id-attr GENE_ID_ATTR] [--tpm-output TPM_OUTPUT]
#                [--fpkm-output FPKM_OUTPUT] [--cpm-output CPM_OUTPUT]
#                [--quantile-output QUANTILE_OUTPUT]
#                expression
#
# TPM normalization. The gene expressions file should include genes in rows and
# samples in columns. The gene ID column should be named FEATURE_ID. The gene
# lengths file should have two columns, FEATURE_ID and GENE_LENGTHS. Gene IDs in
# expressions file should match the gene IDs in gene lengths file.
#
# positional arguments:
#   expression            tab-delimited file with gene expression data (genes in
#                         rows, samples in cols)
#
# optional arguments:
#   -h, --help            show this help message and exit
#   --gene-lengths GENE_LENGTHS
#                         tab-delimited file with gene lengths
#   --annotation ANNOTATION
#                         Annotation file in GTF format
#   --gene-id-attr GENE_ID_ATTR
#                         Gene ID attribute for annotation file
#   --tpm-output TPM_OUTPUT
#                         TPM output file name
#   --fpkm-output FPKM_OUTPUT
#                         FPKM output file name
#   --cpm-output CPM_OUTPUT
#                         CPM output file name
#   --quantile-output QUANTILE_OUTPUT
#                         Quantile-normalized expression output file name

获取有关特定方法的信息，例如 CPM：

rnanorm cpm --help

使用 CPM 进行标准化：

rnanorm cpm exp.csv --out exp_cpm.csv

文件exp.csv需要是逗号分隔的文件，其中基因在列中，样本在行中。值应该是原始counts计数。输出保存到exp_cpm.csv。输入文件示例：

cat exp.csv
,Gene_1,Gene_2,Gene_3,Gene_4,Gene_5
Sample_1,200,300,500,2000,7000
Sample_2,400,600,1000,4000,14000
Sample_3,200,300,500,2000,17000
Sample_4,200,300,500,2000,2000

还可以通过标准输入提供输入：

cat exp.csv | rnanorm cpm --out exp_cpm.csv

如果指定的文件--out已存在，该命令将失败。如果你确定要覆盖，请使用--force参数：

cat exp.csv | rnanorm cpm --force --out exp_cpm.csv

如果没有使用参数指定文件--out，则输出将打印到标准输出：

cat exp.csv | rnanorm cpm > exp_cpm.csv

TPM 和 FPKM 方法需要基因长度。这些可以通过GTF文件或基因长度文件提供。后者是一个两列文件。第一列应包含标题中的基因exp.csv，第二列应包含由联合外显子模型计算的基因长度：

# Use GTF file
rnanorm tpm exp.csv --gtf annotations.gtf > exp_out.csv
# Use gene lengths file
rnanorm tpm exp.csv --gene-lengths lenghts.csv > exp_out.csv
# Example of gene lengths file
cat lenghts.csv
gene_id,gene_length
Gene_1,200
Gene_2,300
Gene_3,500
Gene_4,1000
Gene_5,1000

❝
友情提示：输入的原始counts矩阵文件、基因长度文件，第一列的基因id名字需要修改成“FEATURE_ID”。
❞

OK，今天的分享到此为止，希望能对您有所帮助。

您的关注、点赞、在看、转发是对益站最大的鼓励和支持哈。

联系站长

❝
对本篇文章有疑问，可以在益站发消息留言，也欢迎各位童鞋扫码加入我们的 QQ 交流群。
❞

收录于合集 #转录组

37个

上一篇登核酸研究(IF=19)！转录组废物利用4：转座子/跳跃基因如何影响RNA代谢下一篇精益可变剪切分析——从rMATS到蛋白功能注释（DAS如何影响蛋白功能？）

喜欢此内容的人还喜欢

微信扫一扫
关注该公众号