環境構築の覚え
基本的には次世代シーケンサーDRY解析教本に従うが、違うとこ、気づいたことだけ書いておく。
- Javaのバージョンチェック
#El Capitanのデフォルトは1.6→1.7以上にバージョンアップ(今回は現時点の最新バージョン1.8.0_131-b11)
以前GATKが1.8でじゃ動かなかったらしいが、今は動くらしいので最新版でよさげ。
- MACSとCEASはsudoでインストール→/usr/local/bin/以下にインストールされているが、ngsplotはダウンロードしたところで展開されているだけなのでパスをexportしておく必要がある。
- 教本とは違うディレクトリにGenomeJackを置いた。
練習として比較するサンプルは
ChIP-seq (H3K4me3化されている領域)
FAIRE-seq (オープンクロマチン領域)
それぞれ脂肪化誘導day0, day8のマウス3T3-L1細胞サンプル
レファレンスゲノムはUCSC mm9
さて、まずはマッピング
bowtieでマッピング→samtools viewでSAMをBAMに変換→samtools sortでソート→samtools indexでインデックスを作る。
ここまでプロセッサ数くらいしか弄る項目はないので、スクリプトで一気に処理してしまって問題なし。1ファイルあたり1時間もかからないはず。
fastqファイルのあるディレクトリに移動して
スクリプトbowtie_bam_iに引数としてfastqファイルをまとめてD&Dしてリターン
#!/bin/sh mkdir -p ./sam mkdir -p ./bam #Pre-built indexesを指定 read -p "Pre-built indexes (full path) = " pbi ebwt=${pbi%.*.ebwt} for file in $@ do #ファイル名を取り出す fname="${file##*/}" #拡張子を除く name_pref="${fname%.*}" echo "bowtie -m 1 -p 4 --sam "$ebwt" -q "$fname" "$name_pref".sam" bowtie -m 1 -p 4 --sam "$ebwt" -q "$fname" sam/"$name_pref".sam echo "samtools view -@ 4 -S -b sam/"$name_pref".sam > bam/"$name_pref".bam" samtools view -@ 4 -S -b sam/"$name_pref".sam > bam/"$name_pref".bam echo "samtools sort -@ 4 bam/"$name_pref".bam bam/"$name_pref"_sorted" samtools sort -@ 4 bam/"$name_pref".bam -o bam/"$name_pref"_sorted.bam echo "samtools index -@ 4 bam/"$name_pref"_sorted.bam" samtools index -@ 4 bam/"$name_pref"_sorted.bam done
スクリプトmacs_ceas
最初のfastqファイルのあるディレクトリで実行するとその階層にresultsディレクトリを生成
前の手順で出来上がった*_sorted.bamを処理する。(スクリプトにD&D)
#!/bin/sh #実験名指定 read -p "experient name = " exp_name #reference file read -p "reference file (.refgene) = " refgene file=$@ #ファイル名を取り出す fname="${file##*/}" #拡張子を除く name_pref="${fname%.*}" mkdir -p results/"$name_pref"_"$exp_name" cd results/"$name_pref"_"$exp_name" #MACSでピークコール macs14 -t "$file" --name="$name_pref"_"$exp_name" -f BAM -g mm -S --wig #CEASでアノテーション ceas --name="$name_pref"_"$exp_name"_CEAS -g "$refgene" -b "$name_pref"_"$exp_name"_peaks.bed cd ../../
ChIP-seqピークコール
MACSを使って処理
並列処理はできないっぽいので単純に複数jobを同時進行でやってもらおう。ただし手作業で順番にやっても大して時間はかからない。
オプションについてはもうちょっと調査が必要か。
- -pで検出できるピーク領域に対するp値の閾値を与えられるらしい?(デフォルトは1e^-5)
- -tで解析したいファイルを指定
- -n (又は--name)で実験の名前→結果に表示
- -fがフォーマット。今回の場合BAMで行っている
- -g ゲノムサイズ指定 mmはマウス、デフォルトはヒトでhsになっている
- -S --wig このあたり良くわからない
$ macs14 -help Usage: macs14 <-t tfile> [-n name] [-g genomesize] [options] Example: macs14 -t ChIP.bam -c Control.bam -f BAM -g h -n test -w --call-subpeaks macs14 -- Model-based Analysis for ChIP-Sequencing Options: --version show program's version number and exit -h, --help show this help message and exit. -t TFILE, --treatment=TFILE ChIP-seq treatment files. REQUIRED. When ELANDMULTIPET is selected, you must provide two files separated by comma, e.g. s_1_1_eland_multi.txt,s_1_2_eland_multi.txt -c CFILE, --control=CFILE Control files. When ELANDMULTIPET is selected, you must provide two files separated by comma, e.g. s_2_1_eland_multi.txt,s_2_2_eland_multi.txt -n NAME, --name=NAME Experiment name, which will be used to generate output file names. DEFAULT: "NA" -f FORMAT, --format=FORMAT Format of tag file, "AUTO", "BED" or "ELAND" or "ELANDMULTI" or "ELANDMULTIPET" or "ELANDEXPORT" or "SAM" or "BAM" or "BOWTIE". The default AUTO option will let MACS decide which format the file is. Please check the definition in 00README file if you choose EL AND/ELANDMULTI/ELANDMULTIPET/ELANDEXPORT/SAM/BAM/BOWTI E. DEFAULT: "AUTO" --petdist=PETDIST Best distance between Pair-End Tags. Only available when format is 'ELANDMULTIPET'. DEFAULT: 200 -g GSIZE, --gsize=GSIZE Effective genome size. It can be 1.0e+9 or 1000000000, or shortcuts:'hs' for human (2.7e9), 'mm' for mouse (1.87e9), 'ce' for C. elegans (9e7) and 'dm' for fruitfly (1.2e8), Default:hs -s TSIZE, --tsize=TSIZE Tag size. This will overide the auto detected tag size. --bw=BW Band width. This value is only used while building the shifting model. DEFAULT: 300 -p PVALUE, --pvalue=PVALUE Pvalue cutoff for peak detection. DEFAULT: 1e-5 -m MFOLD, --mfold=MFOLD Select the regions within MFOLD range of high- confidence enrichment ratio against background to build model. The regions must be lower than upper limit, and higher than the lower limit. DEFAULT:10,30 --nolambda If True, MACS will use fixed background lambda as local lambda for every peak region. Normally, MACS calculates a dynamic local lambda to reflect the local bias due to potential chromatin structure. --slocal=SMALLLOCAL The small nearby region in basepairs to calculate dynamic lambda. This is used to capture the bias near the peak summit region. Invalid if there is no control data. DEFAULT: 1000 --llocal=LARGELOCAL The large nearby region in basepairs to calculate dynamic lambda. This is used to capture the surround bias. DEFAULT: 10000. --on-auto Whether turn on the auto pair model process. If set, when MACS failed to build paired model, it will use the nomodel settings, the '--shiftsize' parameter to shift and extend each tags. DEFAULT: False --nomodel Whether or not to build the shifting model. If True, MACS will not build model. by default it means shifting size = 100, try to set shiftsize to change it. DEFAULT: False --shiftsize=SHIFTSIZE The arbitrary shift size in bp. When nomodel is true, MACS will use this value as 1/2 of fragment size. DEFAULT: 100 --keep-dup=KEEPDUPLICATES It controls the MACS behavior towards duplicate tags at the exact same location -- the same coordination and the same strand. The 'auto' option makes MACS calculate the maximum tags at the exact same location based on binomal distribution using 1e-5 as pvalue cutoff; and the 'all' option keeps every tags. If an integer is given, at most this number of tags will be kept at the same location. Default: 1. To only keep one performs the best in terms of detecting enriched regions, from our internal study. --to-large When set, scale the small sample up to the bigger sample. By default, the bigger dataset will be scaled down towards the smaller dataset, which will lead to smaller pvalues and more specific results. Keep in mind that scaling down will bring down background noise more. DEFAULT: False -w, --wig Whether or not to save extended fragment pileup at every WIGEXTEND bps into a wiggle file. When --single- profile is on, only one file for the whole genome is saved. WARNING: this process is time/space consuming!! -B, --bdg Whether or not to save extended fragment pileup at every bp into a bedGraph file. When it's on, -w, --space and --call-subpeaks will be ignored. When --single-profile is on, only one file for the whole genome is saved. WARNING: this process is time/space consuming!! -S, --single-profile When set, a single wiggle file will be saved for treatment and input. Default: False --space=SPACE The resoluation for saving wiggle files, by default, MACS will save the raw tag count every 10 bps. Usable only with '--wig' option. --call-subpeaks If set, MACS will invoke Mali Salmon's PeakSplitter soft through system call. If PeakSplitter can't be found, an instruction will be shown for downloading and installing the PeakSplitter package. -w option needs to be on and -B should be off to let it work. DEFAULT: False --verbose=VERBOSE Set verbose level. 0: only show critical message, 1: show additional warning message, 2: show process information, 3: show debug messages. DEFAULT:2 --diag Whether or not to produce a diagnosis report. It's up to 9X time consuming. Please check 00README file for detail. DEFAULT: False --fe-min=FEMIN For diagnostics, min fold enrichment to consider. DEFAULT: 0 --fe-max=FEMAX For diagnostics, max fold enrichment to consider. DEFAULT: maximum fold enrichment --fe-step=FESTEP For diagnostics, fold enrichment step. DEFAULT: 20
CEASによるピーク領域へのアノテーション
オプションはほとんどデフォルトで
$ ceas -help /Library/Python/2.7/site-packages/CEAS/inout.py:64: UserWarning: sqlite3 is used instead of MySQLdb because MySQLdb is not installed warnings.warn("sqlite3 is used instead of MySQLdb because MySQLdb is not installed") Usage: ceas < input files > [options] CEAS (Cis-regulatory Element Annotation System) Options: --version show program's version number and exit -h, --help Show this help message and exit. -b BED, --bed=BED BED file of ChIP regions. -w WIG, --wig=WIG WIG file for either wig profiling or genome background annotation. WARNING: --bg flag must be set for genome background re-annotation. -e EBED, --ebed=EBED BED file of extra regions of interest (eg, non-coding regions) -g GDB, --gt=GDB Gene annotation table (eg, a refGene table in sqlite3 db format provided through the CEAS web, http://liulab.dfci.harvard.edu/CEAS/download.html). --name=NAME Experiment name. This will be used to name the output files. If an experiment name is not given, the stem of the input BED file name will be used instead (eg, if 'peaks.bed', 'peaks' will be used as a name.) --sizes=SIZES Promoter (also dowsntream) sizes for ChIP region annotation. Comma-separated three values or a single value can be given. If a single value is given, it will be segmented into three equal fractions (ie, 3000 is equivalent to 1000,2000,3000), DEFAULT: 1000,2000,3000. WARNING: Values > 10000bp are automatically set to 10000bp. --bisizes=BISIZES Bidirectional-promoter sizes for ChIP region annotation Comma-separated two values or a single value can be given. If a single value is given, it will be segmented into two equal fractions (ie, 5000 is equivalent to 2500,5000) DEFAULT: 2500,5000bp. WARNING: Values > 20000bp are automatically set to 20000bp. --bg Run genome BG annotation again. WARNING: This flag is effective only if a WIG file is given through -w (--wig). Otherwise, ignored. --span=SPAN Span from TSS and TTS in the gene-centered annotation. ChIP regions within this range from TSS and TTS are considered when calculating the coverage rates in promoter and downstream, DEFAULT=3000bp --pf-res=PF_RES Wig profiling resolution, DEFAULT: 50bp. WARNING: Value smaller than the wig interval (resolution) may cause aliasing error. --rel-dist=REL_DIST Relative distance to TSS/TTS in wig profiling, DEFAULT: 3000bp --gn-groups=GN_GROUPS Gene-groups of particular interest in wig profiling. Each gene group file must have gene names in the 1st column. The file names are separated by commas w/ no space (eg, --gn-groups=top10.txt,bottom10.txt) --gn-group-names=GN_NAMES The names of the gene groups in --gn-groups. The gene group names are separated by commas. (eg, --gn-group- names='top 10%,bottom 10%'). These group names appear in the legends of the wig profiling plots. If no group names given, the groups are represented as 'Group 1, Group2,...Group n'. --gname2 Whether or not use the 'name2' column of the gene annotation table when reading the gene IDs in the files given through --gn-groups. This flag is meaningful only with --gn-groups. --dump Whether to save the raw profiles of near TSS, TTS, and gene body. The file names have a suffix of 'TSS', 'TTS', and 'gene' after the name.