kuroの覚え書き

96の個人的覚え書き

sambambaでmpileupを並列化

snp callを行うのにBAMファイルをsamtools mpileupで処理するのは以前にやった。(4年前)
しかしこのmpileupは4コアあっても有効に使ってくれないのですごく時間がかかった記憶がある。

これを高速に処理する方法はないものかと検索したところ、sambambaなるプログラムにたどり着いた。
http://kazumaxneo.hatenablog.com/entry/2017/12/16/124313

これは使えるかもしれない。変な名前だけど。

samtoolsで行えることを代行して、高速に処理してくれるのかと思いきやsamtoolsには依存しているらしく、裏で何らかの並列処理をかましているものと思われる。

と思ったんだけど、なんかbrewでインストールができない。

$ brew tap Homebrew/homebrew-core
$ brew install sambamba
Updating Homebrew...
Error: No available formula with the name "sambamba" 
==> Searching for a previously deleted formula (in the last month)...
Warning: homebrew/core is shallow clone. To get complete history run:
  git -C "$(brew --repo homebrew/core)" fetch --unshallow

Error: No previously deleted formula found.
==> Searching for similarly named formulae...
==> Searching local taps...
Error: No similarly named formulae found.
==> Searching taps...
==> Searching taps on GitHub...
Error: No formulae found in taps.

https://github.com/biod/sambamba/releases
から直接バイナリをダウンロードして起動してみる。

$ sambamba view
Usage: sambamba-view [options] <input.bam | input.sam> [region1 [...]]

Options: -F, --filter=FILTER
                    set custom filter for alignments
         --num-filter=NUMFILTER
                    filter flag bits; 'i1/i2' corresponds to -f i1 -F i2 samtools arguments;
                    either of the numbers can be omitted
         -f, --format=sam|bam|cram|json
                    specify which format to use for output (default is SAM)
         -h, --with-header
                    print header before reads (always done for BAM output)
         -H, --header
                    output only header to stdout (if format=bam, the header is printed as SAM)
         -I, --reference-info
                    output to stdout only reference names and lengths in JSON
         -L, --regions=FILENAME
                    output only reads overlapping one of regions from the BED file
         -c, --count
                    output to stdout only count of matching records, hHI are ignored
         -v, --valid
                    output only valid alignments
         -S, --sam-input
                    specify that input is in SAM format
         -C, --cram-input
                    specify that input is in CRAM format
         -T, --ref-filename=FASTA
                    specify reference for writing CRAM
         -p, --show-progress
                    show progressbar in STDERR (works only for BAM files with no regions specified)
         -l, --compression-level
                    specify compression level (from 0 to 9, works only for BAM output)
         -o, --output-filename
                    specify output filename
         -t, --nthreads=NTHREADS
                    maximum number of threads to use
         -s, --subsample=FRACTION
                    subsample reads (read pairs)
         --subsampling-seed=SEED
                    set seed for subsampling

いけそうな感じ。

$ sambamba mpileup

と打ってみると、bcftoolsがないというエラーが出た。

$ sambamba mpileup
usage: sambamba-pileup [options] input.bam [input2.bam [...]]
                       [--samtools <samtools mpileup args>]
                       [--bcftools <bcftools call args>]

This subcommand relies on external tools and acts as a multi-core
implementation of samtools and bcftools.

Therefore, the following tools should be present in $PATH:
    * samtools
    * bcftools (when used)

If --samtools is skipped, samtools mpileup is called with default arguments
If --bcftools is used without parameters, samtools is called with
     switch '-gu' and bcftools is called as 'bcftools view -'
If --bcftools is skipped, bcftools is not called

Sambamba splits input BAM files into chunks and feeds them
to samtools mpileup and, optionally, bcftools in parallel.
The chunks are slightly overlapping so that variant calling
should not be impacted by these manipulations. The obtained results
from the multiple processes are combined as ordered output.

Sambamba options:
         -L, --regions=FILENAME
                    provide BED file with regions
                    (no need to duplicate it in samtools args);
                    all input files must be indexed
         -o, --output-filename=<STDOUT>
                    specify output filename
         --tmpdir=TMPDIR
                    directory for temporary files
         -t, --nthreads=NTHREADS
                    maximum number of threads to use
         -b, --buffer-size=64_000_000
                    chunk size (in bytes)
         -B, --output-buffer-size=512_000_000
                    output buffer size (in bytes)

Sambamba paths:

         samtools: /usr/local/bin/samtools Version: 1.4.1 (using htslib 1.4.1)
sambamba-pileup: failed to locate bcftools executable in PATH

bcftoolsも入れておく必要があるらしいのでbrewでインストールする。
こちらはちゃんと普通にインストールできた。

$ bcftools

Program: bcftools (Tools for variant calling and manipulating VCFs and BCFs)
License: GNU GPLv3+, due to use of the GNU Scientific Library
Version: 1.7 (using htslib 1.7)

Usage:   bcftools [--version|--version-only] [--help] <command> <argument>

Commands:

 -- Indexing
    index        index VCF/BCF files

 -- VCF/BCF manipulation
    annotate     annotate and edit VCF/BCF files
    concat       concatenate VCF/BCF files from the same set of samples
    convert      convert VCF/BCF files to different formats and back
    isec         intersections of VCF/BCF files
    merge        merge VCF/BCF files files from non-overlapping sample sets
    norm         left-align and normalize indels
    plugin       user-defined plugins
    query        transform VCF/BCF into user-defined formats
    reheader     modify VCF/BCF header, change sample names
    sort         sort VCF/BCF file
    view         VCF/BCF conversion, view, subset and filter VCF/BCF files

 -- VCF/BCF analysis
    call         SNP/indel calling
    consensus    create consensus sequence by applying VCF variants
    cnv          HMM CNV calling
    csq          call variation consequences
    filter       filter VCF/BCF files using fixed thresholds
    gtcheck      check sample concordance, detect sample swaps and contamination
    mpileup      multi-way pileup producing genotype likelihoods
    polysomy     detect number of chromosomal copies
    roh          identify runs of autozygosity (HMM)
    stats        produce VCF/BCF stats

 Most commands accept VCF, bgzipped VCF, and BCF with the file type detected
 automatically even when streaming from a pipe. Indexed VCF and BCF will work
 in all situations. Un-indexed VCF and BCF and streams will work in most but
 not all situations.

で、もう一度テスト

$ sambamba mpileup
usage: sambamba-pileup [options] input.bam [input2.bam [...]]
                       [--samtools <samtools mpileup args>]
                       [--bcftools <bcftools call args>]

This subcommand relies on external tools and acts as a multi-core
implementation of samtools and bcftools.

Therefore, the following tools should be present in $PATH:
    * samtools
    * bcftools (when used)

If --samtools is skipped, samtools mpileup is called with default arguments
If --bcftools is used without parameters, samtools is called with
     switch '-gu' and bcftools is called as 'bcftools view -'
If --bcftools is skipped, bcftools is not called

Sambamba splits input BAM files into chunks and feeds them
to samtools mpileup and, optionally, bcftools in parallel.
The chunks are slightly overlapping so that variant calling
should not be impacted by these manipulations. The obtained results
from the multiple processes are combined as ordered output.

Sambamba options:
         -L, --regions=FILENAME
                    provide BED file with regions
                    (no need to duplicate it in samtools args);
                    all input files must be indexed
         -o, --output-filename=<STDOUT>
                    specify output filename
         --tmpdir=TMPDIR
                    directory for temporary files
         -t, --nthreads=NTHREADS
                    maximum number of threads to use
         -b, --buffer-size=64_000_000
                    chunk size (in bytes)
         -B, --output-buffer-size=512_000_000
                    output buffer size (in bytes)

Sambamba paths:

         samtools: /usr/local/bin/samtools Version: 1.4.1 (using htslib 1.7)
sambamba-pileup: Can not find version in 
Program: bcftools (Tools for variant calling and manipulating VCFs and BCFs)
License: GNU GPLv3+, due to use of the GNU Scientific Library
Version: 1.7 (using htslib 1.7)

Usage:   bcftools [--version|--version-only] [--help] <command> <argument>

Commands:

 -- Indexing
    index        index VCF/BCF files

 -- VCF/BCF manipulation
    annotate     annotate and edit VCF/BCF files
    concat       concatenate VCF/BCF files from the same set of samples
    convert      convert VCF/BCF files to different formats and back
    isec         intersections of VCF/BCF files
    merge        merge VCF/BCF files files from non-overlapping sample sets
    norm         left-align and normalize indels
    plugin       user-defined plugins
    query        transform VCF/BCF into user-defined formats
    reheader     modify VCF/BCF header, change sample names
    sort         sort VCF/BCF file
    view         VCF/BCF conversion, view, subset and filter VCF/BCF files

 -- VCF/BCF analysis
    call         SNP/indel calling
    consensus    create consensus sequence by applying VCF variants
    cnv          HMM CNV calling
    csq          call variation consequences
    filter       filter VCF/BCF files using fixed thresholds
    gtcheck      check sample concordance, detect sample swaps and contamination
    mpileup      multi-way pileup producing genotype likelihoods
    polysomy     detect number of chromosomal copies
    roh          identify runs of autozygosity (HMM)
    stats        produce VCF/BCF stats

 Most commands accept VCF, bgzipped VCF, and BCF with the file type detected
 automatically even when streaming from a pipe. Indexed VCF and BCF will work
 in all situations. Un-indexed VCF and BCF and streams will work in most but
 not all situations.
$ sambamba mpileup input.sorted.bam -t 4 -p --samtools -uf reference.fasta --bcftools call -c -v -o output_variant.vcf.txt

こんな感じで良いのだろうか?