次世代シークエンサーによるSNPの検出

次世代シークエンサー（NGS)で全ゲノムをリシークエンスし、リファレンスゲノムと比較してSNPを検出する。
NGSから吐き出されたデータから.bamという拡張子のついたバイナリファイルと、そこからvariant callerで解析済みのexcelファイルを手渡されたが、いまいち気に入らないので、自分でvariant callingをやり直すことにした。

用意するもの。
SAMtools　http://samtools.sourceforge.net/
unixのソフトウェアなんで、自分でmakeして実行ファイルを作成する。
zlibがいるのでOSXにはXcodeを入れておく。
環境が整っていたら、ダウンロードして解凍してできたフォルダに入って
$sudo make
するだけ。
もう一つ、bcftoolsフォルダに入ってやはりmakeしてbcftoolsをコンパイルしておく。
その後の面倒を回避すべく.bash_profileに該当するPATHを追加しておくとよい。

referenceのfasta形式ファイル
今回はイネゲノムなので
http://rapdb.dna.affrc.go.jp/download/irgsp1.html
ここからGenome assemblies (12 chromosomes)をダウンロードして解凍しておく。

さて、では.bamファイルからvcfファイル（テキストファイル）にvariant callingを始める。
貰ってきた.bamファイルにはindexファイルがセットになっていたから直接vcfに書き出そうとすると、エラーが出るので、ちゃんと手順を踏むことにする。
まずはsort
$ samtools sort original.bam sorted.bam
original.bamをsortしてsorted.bamを生成
ちなみにこれにはものごっつい時間がかかると思っていい。core2duo 1.6GのMacBook Airでは今回は7時間くらいかかった模様。(Core i7 3.6Gで3時間くらい）待ちきれないので途中で帰っちゃったから詳細は不明。45GBのbamファイルから250MBくらいのファイルが193個一旦出来て、あとでマージされた模様。

次にインデックスを作成
$ samtools index sorted.bam
これでsorted.bam.baiというインデックスファイルが生成される。
これは比較的すぐに終わる。

最後にvcf形式のvariant callされたファイルを書き出す。
$ samtools mpileup -uf reference.fasta sorted.bam | bcftools view -vcg - > variant.vcf
これでreference.fastaにsorted.bamのreedを貼り付けてvariant callingし、variant.vcfというファイルに書き出すことになる。これまた途方も無い時間がかかりそう。イネゲノムを400Mbaseとすると4コアのCore i7 3.6G積んだiMacでもざっと100時間はかかる見込み。アクティビティモニタを見ると１〜2コアしか使ってないようで、最適化がなされていない？

mpileupのオプションは

Input Options:
-6	 Assume the quality is in the Illumina 1.3+ encoding. -A Do not skip anomalous read pairs in variant calling.
-B	 Disable probabilistic realignment for the computation of base alignment quality (BAQ). BAQ is the Phred-scaled probability of a read base being misaligned. Applying this option greatly helps to reduce false SNPs caused by misalignments.
-b FILE	 List of input BAM files, one file per line [null]
-C INT	 Coefficient for downgrading mapping quality for reads containing excessive mismatches. Given a read with a phred-scaled probability q of being generated from the mapped position, the new mapping quality is about sqrt((INT-q)/INT)*INT. A zero value disables this functionality; if enabled, the recommended value for BWA is 50. [0]
-d INT	 At a position, read maximally INT reads per input BAM. [250]
-E	 Extended BAQ computation. This option helps sensitivity especially for MNPs, but may hurt specificity a little bit.
-f FILE	 The faidx-indexed reference file in the FASTA format. The file can be optionally compressed by razip. [null]
-l FILE	 BED or position list file containing a list of regions or sites where pileup or BCF should be generated [null]
-q INT	 Minimum mapping quality for an alignment to be used [0]
-Q INT	 Minimum base quality for a base to be considered [13]
-r STR	 Only generate pileup in region STR [all sites]
Output Options:
 	
-D	 Output per-sample read depth
-g	 Compute genotype likelihoods and output them in the binary call format (BCF).
-S	 Output per-sample Phred-scaled strand bias P-value
-u	 Similar to -g except that the output is uncompressed BCF, which is preferred for piping.
Options for Genotype Likelihood Computation (for -g or -u):
 	
-e INT	 Phred-scaled gap extension sequencing error probability. Reducing INT leads to longer indels. [20]
-h INT	 Coefficient for modeling homopolymer errors. Given an l-long homopolymer run, the sequencing error of an indel of size s is modeled as INT*s/l. [100]
-I	 Do not perform INDEL calling
-L INT	 Skip INDEL calling if the average per-sample depth is above INT. [250]
-o INT	 Phred-scaled gap open sequencing error probability. Reducing INT leads to more indel calls. [40]
-P STR	 Comma dilimited list of platforms (determined by @RG-PL) from which indel candidates are obtained. It is recommended to collect indel candidates from sequencing technologies that have low indel error rate such as ILLUMINA. [all]

ということらしい
おそらく-ufはほぼ必須でしょうか。bcftoolsにパイプするときはoutputオプションの-uで非圧縮が推奨だそうで。-fでreferenceのfastaファイルを指定。-dでreadの上限を制限できるようです。また-rで解析する領域を絞れるようなので、とりあえず狭い範囲で実行してみて様子を見てみるのがいいかもしれません。-r chr01:1,000,000-2,000,000という感じの書式です。

bcftoolsのオプションは

Input/Output Options: 
-A
 	 Retain all possible alternate alleles at variant sites. By default, the view command discards unlikely alleles.
-b	 Output in the BCF format. The default is VCF.
-D FILE	 Sequence dictionary (list of chromosome names) for VCF->BCF conversion [null]
-F	 Indicate PL is generated by r921 or before (ordering is different).
-G	 Suppress all individual genotype information.
-l FILE	 List of sites at which information are outputted [all sites]
-N	 Skip sites where the REF field is not A/C/G/T
-Q	 Output the QCALL likelihood format
-s FILE	 List of samples to use. The first column in the input gives the sample names and the second gives the ploidy, which can only be 1 or 2. When the 2nd column is absent, the sample ploidy is assumed to be 2. In the output, the ordering of samples will be identical to the one in FILE. [null]
-S	 The input is VCF instead of BCF.
-u	 Uncompressed BCF output (force -b).
Consensus/Variant Calling Options: 
-c
 	 Call variants using Bayesian inference. This option automatically invokes option -e.
-d FLOAT	 When -v is in use, skip loci where the fraction of samples covered by reads is below FLOAT. [0]
-e	 Perform max-likelihood inference only, including estimating the site allele frequency, testing Hardy-Weinberg equlibrium and testing associations with LRT.
-g	 Call per-sample genotypes at variant sites (force -c)
-i FLOAT	 Ratio of INDEL-to-SNP mutation rate [0.15]
-p FLOAT	 A site is considered to be a variant if P(ref|D)<FLOAT [0.5]
-P STR	 Prior or initial allele frequency spectrum. If STR can be full, cond2, flat or the file consisting of error output from a previous variant calling run.
-t FLOAT	 Scaled muttion rate for variant calling [0.001]
-T STR	 Enable pair/trio calling. For trio calling, option -s is usually needed to be applied to configure the trio members and their ordering. In the file supplied to the option -s, the first sample must be the child, the second the father and the third the mother. The valid values of STR are ‘pair’, ‘trioauto’, ‘trioxd’ and ‘trioxs’, where ‘pair’ calls differences between two input samples, and ‘trioxd’ (‘trioxs’) specifies that the input is from the X chromosome non-PAR regions and the child is a female (male). [null]
-v	 Output variant sites only (force -c)
Contrast Calling and Association Test Options: 
-1 INT
 	 Number of group-1 samples. This option is used for dividing the samples into two groups for contrast SNP calling or association test. When this option is in use, the following VCF INFO will be outputted: PC2, PCHI2 and QCHI2. [0]
-U INT	 Number of permutations for association test (effective only with -1) [0]
-X FLOAT	 Only perform permutations for P(chi^2)<FLOAT (effective only with -U) [0.01]

だそうで。

bをつけて一旦bcfという形式に落としておいてからvcfutils.plというのでさらにオプションを付ける方法があるらしい。

現在のところbcftoolsにはvcfファイルを読み込んでbcfに書き戻す機能はないらしい。vcfutilsを単体で利用できるかは調査中。

なお吐き出されたvcfファイルのサンプルは以下の様な感じ

##fileformat=VCFv4.0
##fileDate=20090805
##source=myImputationProgramV3.1
##reference=1000GenomesPilot-NCBI36
##phasing=partial
##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">
##INFO=<ID=AF,Number=.,Type=Float,Description="Allele Frequency">
##INFO=<ID=AA,Number=1,Type=String,Description="Ancestral Allele">
##INFO=<ID=DB,Number=0,Type=Flag,Description="dbSNP membership, build 129">
##INFO=<ID=H2,Number=0,Type=Flag,Description="HapMap2 membership">
##FILTER=<ID=q10,Description="Quality below 10">
##FILTER=<ID=s50,Description="Less than 50% of samples have data">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth">
##FORMAT=<ID=HQ,Number=2,Type=Integer,Description="Haplotype Quality">
#CHROM POS     ID        REF ALT    QUAL FILTER INFO                              FORMAT      NA00001        NA00002        NA00003
20     14370   rs6054257 G      A       29   PASS   NS=3;DP=14;AF=0.5;DB;H2           GT:GQ:DP:HQ 0|0:48:1:51,51 1|0:48:8:51,51 1/1:43:5:.,.
20     17330   .         T      A       3    q10    NS=3;DP=11;AF=0.017               GT:GQ:DP:HQ 0|0:49:3:58,50 0|1:3:5:65,3   0/0:41:3
20     1110696 rs6040355 A      G,T     67   PASS   NS=2;DP=10;AF=0.333,0.667;AA=T;DB GT:GQ:DP:HQ 1|2:21:6:23,27 2|1:2:0:18,2   2/2:35:4
20     1230237 .         T      .       47   PASS   NS=3;DP=13;AA=T                   GT:GQ:DP:HQ 0|0:54:7:56,60 0|0:48:4:51,51 0/0:61:2
20     1234567 microsat1 GTCT   G,GTACT 50   PASS   NS=3;DP=9;AA=G                    GT:GQ:DP    0/1:35:4       0/2:17:2       1/1:40:3

参考にしたページ
http://samtools.sourceforge.net/mpileup.shtml
https://wikis.utexas.edu/display/bioiteam/Variant+calling+using+SAMtools
http://ged.msu.edu/angus/tutorials-2013/snp_tutorial.html