snp callを行うのにBAMファイルをsamtools mpileupで処理するのは以前にやった。(4年前)
しかしこのmpileupは4コアあっても有効に使ってくれないのですごく時間がかかった記憶がある。
これを高速に処理する方法はないものかと検索したところ、sambambaなるプログラムにたどり着いた。
http://kazumaxneo.hatenablog.com/entry/2017/12/16/124313
これは使えるかもしれない。変な名前だけど。
samtoolsで行えることを代行して、高速に処理してくれるのかと思いきやsamtoolsには依存しているらしく、裏で何らかの並列処理をかましているものと思われる。
と思ったんだけど、なんかbrewでインストールができない。
$ brew tap Homebrew/homebrew-core $ brew install sambamba Updating Homebrew... Error: No available formula with the name "sambamba" ==> Searching for a previously deleted formula (in the last month)... Warning: homebrew/core is shallow clone. To get complete history run: git -C "$(brew --repo homebrew/core)" fetch --unshallow Error: No previously deleted formula found. ==> Searching for similarly named formulae... ==> Searching local taps... Error: No similarly named formulae found. ==> Searching taps... ==> Searching taps on GitHub... Error: No formulae found in taps.
https://github.com/biod/sambamba/releases
から直接バイナリをダウンロードして起動してみる。
$ sambamba view Usage: sambamba-view [options] <input.bam | input.sam> [region1 [...]] Options: -F, --filter=FILTER set custom filter for alignments --num-filter=NUMFILTER filter flag bits; 'i1/i2' corresponds to -f i1 -F i2 samtools arguments; either of the numbers can be omitted -f, --format=sam|bam|cram|json specify which format to use for output (default is SAM) -h, --with-header print header before reads (always done for BAM output) -H, --header output only header to stdout (if format=bam, the header is printed as SAM) -I, --reference-info output to stdout only reference names and lengths in JSON -L, --regions=FILENAME output only reads overlapping one of regions from the BED file -c, --count output to stdout only count of matching records, hHI are ignored -v, --valid output only valid alignments -S, --sam-input specify that input is in SAM format -C, --cram-input specify that input is in CRAM format -T, --ref-filename=FASTA specify reference for writing CRAM -p, --show-progress show progressbar in STDERR (works only for BAM files with no regions specified) -l, --compression-level specify compression level (from 0 to 9, works only for BAM output) -o, --output-filename specify output filename -t, --nthreads=NTHREADS maximum number of threads to use -s, --subsample=FRACTION subsample reads (read pairs) --subsampling-seed=SEED set seed for subsampling
いけそうな感じ。
$ sambamba mpileup
と打ってみると、bcftoolsがないというエラーが出た。
$ sambamba mpileup usage: sambamba-pileup [options] input.bam [input2.bam [...]] [--samtools <samtools mpileup args>] [--bcftools <bcftools call args>] This subcommand relies on external tools and acts as a multi-core implementation of samtools and bcftools. Therefore, the following tools should be present in $PATH: * samtools * bcftools (when used) If --samtools is skipped, samtools mpileup is called with default arguments If --bcftools is used without parameters, samtools is called with switch '-gu' and bcftools is called as 'bcftools view -' If --bcftools is skipped, bcftools is not called Sambamba splits input BAM files into chunks and feeds them to samtools mpileup and, optionally, bcftools in parallel. The chunks are slightly overlapping so that variant calling should not be impacted by these manipulations. The obtained results from the multiple processes are combined as ordered output. Sambamba options: -L, --regions=FILENAME provide BED file with regions (no need to duplicate it in samtools args); all input files must be indexed -o, --output-filename=<STDOUT> specify output filename --tmpdir=TMPDIR directory for temporary files -t, --nthreads=NTHREADS maximum number of threads to use -b, --buffer-size=64_000_000 chunk size (in bytes) -B, --output-buffer-size=512_000_000 output buffer size (in bytes) Sambamba paths: samtools: /usr/local/bin/samtools Version: 1.4.1 (using htslib 1.4.1) sambamba-pileup: failed to locate bcftools executable in PATH
bcftoolsも入れておく必要があるらしいのでbrewでインストールする。
こちらはちゃんと普通にインストールできた。
$ bcftools Program: bcftools (Tools for variant calling and manipulating VCFs and BCFs) License: GNU GPLv3+, due to use of the GNU Scientific Library Version: 1.7 (using htslib 1.7) Usage: bcftools [--version|--version-only] [--help] <command> <argument> Commands: -- Indexing index index VCF/BCF files -- VCF/BCF manipulation annotate annotate and edit VCF/BCF files concat concatenate VCF/BCF files from the same set of samples convert convert VCF/BCF files to different formats and back isec intersections of VCF/BCF files merge merge VCF/BCF files files from non-overlapping sample sets norm left-align and normalize indels plugin user-defined plugins query transform VCF/BCF into user-defined formats reheader modify VCF/BCF header, change sample names sort sort VCF/BCF file view VCF/BCF conversion, view, subset and filter VCF/BCF files -- VCF/BCF analysis call SNP/indel calling consensus create consensus sequence by applying VCF variants cnv HMM CNV calling csq call variation consequences filter filter VCF/BCF files using fixed thresholds gtcheck check sample concordance, detect sample swaps and contamination mpileup multi-way pileup producing genotype likelihoods polysomy detect number of chromosomal copies roh identify runs of autozygosity (HMM) stats produce VCF/BCF stats Most commands accept VCF, bgzipped VCF, and BCF with the file type detected automatically even when streaming from a pipe. Indexed VCF and BCF will work in all situations. Un-indexed VCF and BCF and streams will work in most but not all situations.
で、もう一度テスト
$ sambamba mpileup usage: sambamba-pileup [options] input.bam [input2.bam [...]] [--samtools <samtools mpileup args>] [--bcftools <bcftools call args>] This subcommand relies on external tools and acts as a multi-core implementation of samtools and bcftools. Therefore, the following tools should be present in $PATH: * samtools * bcftools (when used) If --samtools is skipped, samtools mpileup is called with default arguments If --bcftools is used without parameters, samtools is called with switch '-gu' and bcftools is called as 'bcftools view -' If --bcftools is skipped, bcftools is not called Sambamba splits input BAM files into chunks and feeds them to samtools mpileup and, optionally, bcftools in parallel. The chunks are slightly overlapping so that variant calling should not be impacted by these manipulations. The obtained results from the multiple processes are combined as ordered output. Sambamba options: -L, --regions=FILENAME provide BED file with regions (no need to duplicate it in samtools args); all input files must be indexed -o, --output-filename=<STDOUT> specify output filename --tmpdir=TMPDIR directory for temporary files -t, --nthreads=NTHREADS maximum number of threads to use -b, --buffer-size=64_000_000 chunk size (in bytes) -B, --output-buffer-size=512_000_000 output buffer size (in bytes) Sambamba paths: samtools: /usr/local/bin/samtools Version: 1.4.1 (using htslib 1.7) sambamba-pileup: Can not find version in Program: bcftools (Tools for variant calling and manipulating VCFs and BCFs) License: GNU GPLv3+, due to use of the GNU Scientific Library Version: 1.7 (using htslib 1.7) Usage: bcftools [--version|--version-only] [--help] <command> <argument> Commands: -- Indexing index index VCF/BCF files -- VCF/BCF manipulation annotate annotate and edit VCF/BCF files concat concatenate VCF/BCF files from the same set of samples convert convert VCF/BCF files to different formats and back isec intersections of VCF/BCF files merge merge VCF/BCF files files from non-overlapping sample sets norm left-align and normalize indels plugin user-defined plugins query transform VCF/BCF into user-defined formats reheader modify VCF/BCF header, change sample names sort sort VCF/BCF file view VCF/BCF conversion, view, subset and filter VCF/BCF files -- VCF/BCF analysis call SNP/indel calling consensus create consensus sequence by applying VCF variants cnv HMM CNV calling csq call variation consequences filter filter VCF/BCF files using fixed thresholds gtcheck check sample concordance, detect sample swaps and contamination mpileup multi-way pileup producing genotype likelihoods polysomy detect number of chromosomal copies roh identify runs of autozygosity (HMM) stats produce VCF/BCF stats Most commands accept VCF, bgzipped VCF, and BCF with the file type detected automatically even when streaming from a pipe. Indexed VCF and BCF will work in all situations. Un-indexed VCF and BCF and streams will work in most but not all situations.
$ sambamba mpileup input.sorted.bam -t 4 -p --samtools -uf reference.fasta --bcftools call -c -v -o output_variant.vcf.txt
こんな感じで良いのだろうか?