以下文章来源于生信菜鸟团 ,作者Quasimodo
生信菜鸟团荣誉归来,让所有想分析生物信息学数据的小伙伴找到归属,你值得拥有!
snakemake基于python实现,安装也很简单直接用pip就行
pip install snakemake
此外,本套使用snakemake管理的流程中的其他软件安装,参考往期推文
samples = ["SRR20653290","SRR20653291","SRR20653292","SRR20653293","SRR20653294","SRR20653295"]
rule target:
input:
"end_signal_SRR20653290",
"end_signal_SRR20653291",
"end_signal_SRR20653292",
"end_signal_SRR20653293",
"end_signal_SRR20653294",
"end_signal_SRR20653295"
rule get_fa_index:
output:
"ref/Homo_sapiens.GRCh38.dna.toplevel.fa",
"ref/human_hisat2"
shell:
"""
mkdir ref
axel -n20 https://ftp.ensembl.org/pub/release-110/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.toplevel.fa.gz -o ref/
gunzip ref/Homo_sapiens.GRCh38.dna.toplevel.fa.gz
hisat2-build -p 12 ref/Homo_sapiens.GRCh38.dna.toplevel.fa ref/human_hisat2
"""
rule get_gtf:
output:
"ref/gencode.v44.annotation_RmChr.gtf"
shell:
"""
axel -n20 https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_44/gencode.v44.annotation.gtf.gz -o ref/
gunzip gencode.v44.annotation.gtf.gz
sed 's/^chr//' ref/gencode.v44.annotation.gtf > ref/gencode.v44.annotation_RmChr.gtf
"""
rule get_fq:
output:
"raw_fq/{id}_1.fastq.gz",
"raw_fq/{id}_2.fastq.gz"
shell:
"""
mkdir raw_fq
cd raw_fq/ && kingfisher get -p PRJNA862537 -m ena-ascp ena-ftp prefetch aws-http && cd ..
"""
rule fastp:
input:
"raw_fq/{id}_1.fastq.gz",
"raw_fq/{id}_2.fastq.gz"
output:
"clean_fq/{id}_1.fastp.fq.gz",
"clean_fq/{id}_2.fastp.fq.gz",
"{id}.html"
shell:
"""
mkdir clean_fq
fastp -i {input[0]} -o {output[0]} -I {input[1]} -O {output[1]} -l 36 -q 20 --compression=6 -R {wildcards.id} -h {output[2]}
"""
rule hisat2_samtools:
input:
"ref/human_hisat2",
"clean_fq/{id}_1.fastp.fq.gz",
"clean_fq/{id}_2.fastp.fq.gz"
output:
"bam/{id}.bam"
shell:
"""
mkdir bam
hisat2 -p 12 -x {input[0]} -1 {input[1]} -2 {input[2]} | samtools sort -@ 4 -o {output}
"""
rule assembly_gtf:
input:
"ref/gencode.v44.annotation_RmChr.gtf",
"bam/{id}.bam"
output:
"assembly_gtf/{id}.gtf"
shell:
"""
mkdir assembly_gtf
stringtie -p 12 -G {input[0]} -o {output} -l {wildcards.id} {input[1]}
"""
rule merge_gtf:
input:
["assembly_gtf/{sample_id}.gtf".format(sample_id=sample_id) for sample_id in samples]
output:
"merge_gtf/merged.gtf"
shell:
"""
mkdir merge_gtf
ls {input} > merge_gtf/gtf.list
stringtie --merge -p 12 -G {input} -o {output} merge_gtf/gtf.list
"""
rule quantity:
input:
"merge_gtf/merged.gtf",
"bam/{id}.bam"
output:
"quantity/{id}/gtf",
"quantity/{id}_gene_abundences.tsv"
shell:
"""
mkdir quantity
stringtie -p 12 -e -G {input[0]} -o {output[0]} -A {output[1]} {input[1]}
"""
rule end_sig:
input:
"quantity/{id}/gtf",
"quantity/{id}_gene_abundences.tsv"
output:
"end_signal_{id}"
shell:
"touch end_signal_{wildcards.id}"
我们dry run一下:
$snakemake --help | grep "dry"
usage: snakemake [-h] [--dry-run] [--profile PROFILE]
--dry-run, --dryrun, -n
done. If you have a very large workflow, use --dry-run
together with --dry-run to list files without actually
Use together with --dry-run to list files without
$snakemake -n
Building DAG of jobs...
Job stats:
job count
--------------- -------
assembly_gtf 6
end_sig 6
fastp 6
get_fa_index 1
get_fq 6
get_gtf 1
hisat2_samtools 6
merge_gtf 1
quantity 6
target 1
total 40
可视化之前需要安装graphviz
mamba install graphviz
可视化
snakemake --dag | dot -Tpdf > test1.pdf
流程输入输出非常清晰
下面我们就看看编写snakemake代码时有哪些tricks,非常基础的部分参考上面给出的推文即可
①一个snakemake流程由多个rule组成,每个rule都可以由input、output和处理(shell、script)三个部分组成,一开始编写的时候很容易忘记在shell中将输出输出部分改为{input} {output},而是像一般软件执行那样输入具体的文件路径名称
②每个rule的输入输出需要串得起来,snakemake正是通过rule的input和output部分完成流程的搭建,并匹配通配符
③在shell部分使用通配符时,需要注意加上wildcards,如{wildcards.id}
如
有的时候我们会进行选择,比如是去下ENSEMBL的参考基因组,还是UCSC的
当然这不需要我们手动选择
如
-R
参数指定使用的ruleSnakefile中定义的规则中自上而下的进行匹配
在我上面编写的snakemake中merge_gtf部分一度让我头疼
因为你如果把input写成"assembly_gtf/{id}.gtf"
,就会报错:
rule merge_gtf:
input:
"assembly_gtf/{id}.gtf"
output:
"merge_gtf/merged.gtf"
shell:
"""
mkdir merge_gtf
ls {input} > merge_gtf/gtf.list
stringtie --merge -p 12 -G {input} -o {output} merge_gtf/gtf.list
"""
$snakemake -n
Building DAG of jobs...
WildcardError in rule merge_gtf in file /home/data/t120455/snakemake_demo/snakemake_tt/Snakefile, line 96:
Wildcards in input files cannot be determined from output files:
'id'
merge_gtf上一条rule:assembly_gtf的输出文件就是assembly_gtf/{id}.gtf
,所以很容易写成这样
这里我们其实需要的是聚合操作
rule merge_gtf:
input:
["assembly_gtf/{sample_id}.gtf".format(sample_id=sample_id) for sample_id in samples]
...
可以看出类似python列表推导式的格式,①
②还可以使用其他函数来实现这个功能
参考:
Snakemake:简单好用的生信流程搭建工具(http://www.biomarker.com.cn/archives/19623)
以及 Snakemake 搭建流程 - 聚合
以上就是本期全部内容
强烈建议你推荐给身边的博士后以及年轻生物学PI,多一点数据认知,让他们的科研上一个台阶:
微信扫一扫
关注该公众号