sambambaでmpileupを並列化
snp callを行うのにBAMファイルをsamtools mpileupで処理するのは以前にやった。(4年前)
しかしこのmpileupは4コアあっても有効に使ってくれないのですごく時間がかかった記憶がある。
これを高速に処理する方法はないものかと検索したところ、sambambaなるプログラムにたどり着いた。
http://kazumaxneo.hatenablog.com/entry/2017/12/16/124313
これは使えるかもしれない。変な名前だけど。
samtoolsで行えることを代行して、高速に処理してくれるのかと思いきやsamtoolsには依存しているらしく、裏で何らかの並列処理をかましているものと思われる。
と思ったんだけど、なんかbrewでインストールができない。
$ brew tap Homebrew/homebrew-core $ brew install sambamba Updating Homebrew... Error: No available formula with the name "sambamba" ==> Searching for a previously deleted formula (in the last month)... Warning: homebrew/core is shallow clone. To get complete history run: git -C "$(brew --repo homebrew/core)" fetch --unshallow Error: No previously deleted formula found. ==> Searching for similarly named formulae... ==> Searching local taps... Error: No similarly named formulae found. ==> Searching taps... ==> Searching taps on GitHub... Error: No formulae found in taps.
https://github.com/biod/sambamba/releases
から直接バイナリをダウンロードして起動してみる。
$ sambamba view
Usage: sambamba-view [options] <input.bam | input.sam> [region1 [...]]
Options: -F, --filter=FILTER
set custom filter for alignments
--num-filter=NUMFILTER
filter flag bits; 'i1/i2' corresponds to -f i1 -F i2 samtools arguments;
either of the numbers can be omitted
-f, --format=sam|bam|cram|json
specify which format to use for output (default is SAM)
-h, --with-header
print header before reads (always done for BAM output)
-H, --header
output only header to stdout (if format=bam, the header is printed as SAM)
-I, --reference-info
output to stdout only reference names and lengths in JSON
-L, --regions=FILENAME
output only reads overlapping one of regions from the BED file
-c, --count
output to stdout only count of matching records, hHI are ignored
-v, --valid
output only valid alignments
-S, --sam-input
specify that input is in SAM format
-C, --cram-input
specify that input is in CRAM format
-T, --ref-filename=FASTA
specify reference for writing CRAM
-p, --show-progress
show progressbar in STDERR (works only for BAM files with no regions specified)
-l, --compression-level
specify compression level (from 0 to 9, works only for BAM output)
-o, --output-filename
specify output filename
-t, --nthreads=NTHREADS
maximum number of threads to use
-s, --subsample=FRACTION
subsample reads (read pairs)
--subsampling-seed=SEED
set seed for subsamplingいけそうな感じ。
$ sambamba mpileup
と打ってみると、bcftoolsがないというエラーが出た。
$ sambamba mpileup
usage: sambamba-pileup [options] input.bam [input2.bam [...]]
[--samtools <samtools mpileup args>]
[--bcftools <bcftools call args>]
This subcommand relies on external tools and acts as a multi-core
implementation of samtools and bcftools.
Therefore, the following tools should be present in $PATH:
* samtools
* bcftools (when used)
If --samtools is skipped, samtools mpileup is called with default arguments
If --bcftools is used without parameters, samtools is called with
switch '-gu' and bcftools is called as 'bcftools view -'
If --bcftools is skipped, bcftools is not called
Sambamba splits input BAM files into chunks and feeds them
to samtools mpileup and, optionally, bcftools in parallel.
The chunks are slightly overlapping so that variant calling
should not be impacted by these manipulations. The obtained results
from the multiple processes are combined as ordered output.
Sambamba options:
-L, --regions=FILENAME
provide BED file with regions
(no need to duplicate it in samtools args);
all input files must be indexed
-o, --output-filename=<STDOUT>
specify output filename
--tmpdir=TMPDIR
directory for temporary files
-t, --nthreads=NTHREADS
maximum number of threads to use
-b, --buffer-size=64_000_000
chunk size (in bytes)
-B, --output-buffer-size=512_000_000
output buffer size (in bytes)
Sambamba paths:
samtools: /usr/local/bin/samtools Version: 1.4.1 (using htslib 1.4.1)
sambamba-pileup: failed to locate bcftools executable in PATHbcftoolsも入れておく必要があるらしいのでbrewでインストールする。
こちらはちゃんと普通にインストールできた。
$ bcftools
Program: bcftools (Tools for variant calling and manipulating VCFs and BCFs)
License: GNU GPLv3+, due to use of the GNU Scientific Library
Version: 1.7 (using htslib 1.7)
Usage: bcftools [--version|--version-only] [--help] <command> <argument>
Commands:
-- Indexing
index index VCF/BCF files
-- VCF/BCF manipulation
annotate annotate and edit VCF/BCF files
concat concatenate VCF/BCF files from the same set of samples
convert convert VCF/BCF files to different formats and back
isec intersections of VCF/BCF files
merge merge VCF/BCF files files from non-overlapping sample sets
norm left-align and normalize indels
plugin user-defined plugins
query transform VCF/BCF into user-defined formats
reheader modify VCF/BCF header, change sample names
sort sort VCF/BCF file
view VCF/BCF conversion, view, subset and filter VCF/BCF files
-- VCF/BCF analysis
call SNP/indel calling
consensus create consensus sequence by applying VCF variants
cnv HMM CNV calling
csq call variation consequences
filter filter VCF/BCF files using fixed thresholds
gtcheck check sample concordance, detect sample swaps and contamination
mpileup multi-way pileup producing genotype likelihoods
polysomy detect number of chromosomal copies
roh identify runs of autozygosity (HMM)
stats produce VCF/BCF stats
Most commands accept VCF, bgzipped VCF, and BCF with the file type detected
automatically even when streaming from a pipe. Indexed VCF and BCF will work
in all situations. Un-indexed VCF and BCF and streams will work in most but
not all situations.で、もう一度テスト
$ sambamba mpileup
usage: sambamba-pileup [options] input.bam [input2.bam [...]]
[--samtools <samtools mpileup args>]
[--bcftools <bcftools call args>]
This subcommand relies on external tools and acts as a multi-core
implementation of samtools and bcftools.
Therefore, the following tools should be present in $PATH:
* samtools
* bcftools (when used)
If --samtools is skipped, samtools mpileup is called with default arguments
If --bcftools is used without parameters, samtools is called with
switch '-gu' and bcftools is called as 'bcftools view -'
If --bcftools is skipped, bcftools is not called
Sambamba splits input BAM files into chunks and feeds them
to samtools mpileup and, optionally, bcftools in parallel.
The chunks are slightly overlapping so that variant calling
should not be impacted by these manipulations. The obtained results
from the multiple processes are combined as ordered output.
Sambamba options:
-L, --regions=FILENAME
provide BED file with regions
(no need to duplicate it in samtools args);
all input files must be indexed
-o, --output-filename=<STDOUT>
specify output filename
--tmpdir=TMPDIR
directory for temporary files
-t, --nthreads=NTHREADS
maximum number of threads to use
-b, --buffer-size=64_000_000
chunk size (in bytes)
-B, --output-buffer-size=512_000_000
output buffer size (in bytes)
Sambamba paths:
samtools: /usr/local/bin/samtools Version: 1.4.1 (using htslib 1.7)
sambamba-pileup: Can not find version in
Program: bcftools (Tools for variant calling and manipulating VCFs and BCFs)
License: GNU GPLv3+, due to use of the GNU Scientific Library
Version: 1.7 (using htslib 1.7)
Usage: bcftools [--version|--version-only] [--help] <command> <argument>
Commands:
-- Indexing
index index VCF/BCF files
-- VCF/BCF manipulation
annotate annotate and edit VCF/BCF files
concat concatenate VCF/BCF files from the same set of samples
convert convert VCF/BCF files to different formats and back
isec intersections of VCF/BCF files
merge merge VCF/BCF files files from non-overlapping sample sets
norm left-align and normalize indels
plugin user-defined plugins
query transform VCF/BCF into user-defined formats
reheader modify VCF/BCF header, change sample names
sort sort VCF/BCF file
view VCF/BCF conversion, view, subset and filter VCF/BCF files
-- VCF/BCF analysis
call SNP/indel calling
consensus create consensus sequence by applying VCF variants
cnv HMM CNV calling
csq call variation consequences
filter filter VCF/BCF files using fixed thresholds
gtcheck check sample concordance, detect sample swaps and contamination
mpileup multi-way pileup producing genotype likelihoods
polysomy detect number of chromosomal copies
roh identify runs of autozygosity (HMM)
stats produce VCF/BCF stats
Most commands accept VCF, bgzipped VCF, and BCF with the file type detected
automatically even when streaming from a pipe. Indexed VCF and BCF will work
in all situations. Un-indexed VCF and BCF and streams will work in most but
not all situations.$ sambamba mpileup input.sorted.bam -t 4 -p --samtools -uf reference.fasta --bcftools call -c -v -o output_variant.vcf.txt
こんな感じで良いのだろうか?