kuroの覚え書き

96の個人的覚え書き

ChIP-seq

環境構築の覚え

基本的には次世代シーケンサーDRY解析教本に従うが、違うとこ、気づいたことだけ書いておく。

  • Javaのバージョンチェック 

#El Capitanのデフォルトは1.6→1.7以上にバージョンアップ(今回は現時点の最新バージョン1.8.0_131-b11)
以前GATKが1.8でじゃ動かなかったらしいが、今は動くらしいので最新版でよさげ。

  • MACSとCEASはsudoでインストール→/usr/local/bin/以下にインストールされているが、ngsplotはダウンロードしたところで展開されているだけなのでパスをexportしておく必要がある。

練習として比較するサンプルは
ChIP-seq (H3K4me3化されている領域)
FAIRE-seq (オープンクロマチン領域)
それぞれ脂肪化誘導day0, day8のマウス3T3-L1細胞サンプル
レファレンスゲノムはUCSC mm9


さて、まずはマッピング
bowtieでマッピング→samtools viewでSAMをBAMに変換→samtools sortでソート→samtools indexでインデックスを作る。
ここまでプロセッサ数くらいしか弄る項目はないので、スクリプトで一気に処理してしまって問題なし。1ファイルあたり1時間もかからないはず。
fastqファイルのあるディレクトリに移動して
スクリプトbowtie_bam_iに引数としてfastqファイルをまとめてD&Dしてリターン

#!/bin/sh
mkdir -p ./sam
mkdir -p ./bam
#Pre-built indexesを指定
read -p "Pre-built indexes (full path) = " pbi
ebwt=${pbi%.*.ebwt}
for file in $@
do
#ファイル名を取り出す
fname="${file##*/}"
#拡張子を除く
name_pref="${fname%.*}"
echo "bowtie -m 1 -p 4 --sam "$ebwt" -q "$fname" "$name_pref".sam"
bowtie -m 1 -p 4 --sam "$ebwt" -q "$fname" sam/"$name_pref".sam
echo "samtools view -@ 4 -S -b sam/"$name_pref".sam > bam/"$name_pref".bam" 
samtools view -@ 4 -S -b sam/"$name_pref".sam > bam/"$name_pref".bam
echo "samtools sort -@ 4 bam/"$name_pref".bam bam/"$name_pref"_sorted"
samtools sort -@ 4 bam/"$name_pref".bam -o bam/"$name_pref"_sorted.bam
echo "samtools index -@ 4 bam/"$name_pref"_sorted.bam"
samtools index -@ 4 bam/"$name_pref"_sorted.bam
done

スクリプトmacs_ceas
最初のfastqファイルのあるディレクトリで実行するとその階層にresultsディレクトリを生成
前の手順で出来上がった*_sorted.bamを処理する。(スクリプトD&D)

#!/bin/sh
#実験名指定
read -p "experient name = " exp_name
#reference file
read -p "reference file (.refgene) = " refgene
file=$@
#ファイル名を取り出す
fname="${file##*/}"
#拡張子を除く
name_pref="${fname%.*}"

mkdir -p results/"$name_pref"_"$exp_name"
cd results/"$name_pref"_"$exp_name"

#MACSでピークコール
macs14 -t "$file" --name="$name_pref"_"$exp_name" -f BAM -g mm -S --wig

#CEASでアノテーション
ceas --name="$name_pref"_"$exp_name"_CEAS -g "$refgene" -b "$name_pref"_"$exp_name"_peaks.bed

cd ../../

ChIP-seqピークコール
MACSを使って処理
並列処理はできないっぽいので単純に複数jobを同時進行でやってもらおう。ただし手作業で順番にやっても大して時間はかからない。



オプションについてはもうちょっと調査が必要か。

  • -pで検出できるピーク領域に対するp値の閾値を与えられるらしい?(デフォルトは1e^-5)
  • -tで解析したいファイルを指定
  • -n (又は--name)で実験の名前→結果に表示
  • -fがフォーマット。今回の場合BAMで行っている
  • -g ゲノムサイズ指定 mmはマウス、デフォルトはヒトでhsになっている
  • -S --wig このあたり良くわからない
$ macs14 -help
Usage: macs14 <-t tfile> [-n name] [-g genomesize] [options]

Example: macs14 -t ChIP.bam -c Control.bam -f BAM -g h -n test -w --call-subpeaks


macs14 -- Model-based Analysis for ChIP-Sequencing

Options:
  --version             show program's version number and exit
  -h, --help            show this help message and exit.
  -t TFILE, --treatment=TFILE
                        ChIP-seq treatment files. REQUIRED. When ELANDMULTIPET
                        is selected, you must provide two files separated by
                        comma, e.g.
                        s_1_1_eland_multi.txt,s_1_2_eland_multi.txt
  -c CFILE, --control=CFILE
                        Control files. When ELANDMULTIPET is selected, you
                        must provide two files separated by comma, e.g.
                        s_2_1_eland_multi.txt,s_2_2_eland_multi.txt
  -n NAME, --name=NAME  Experiment name, which will be used to generate output
                        file names. DEFAULT: "NA"
  -f FORMAT, --format=FORMAT
                        Format of tag file, "AUTO", "BED" or "ELAND" or
                        "ELANDMULTI" or "ELANDMULTIPET" or "ELANDEXPORT" or
                        "SAM" or "BAM" or "BOWTIE". The default AUTO option
                        will let MACS decide which format the file is. Please
                        check the definition in 00README file if you choose EL
                        AND/ELANDMULTI/ELANDMULTIPET/ELANDEXPORT/SAM/BAM/BOWTI
                        E. DEFAULT: "AUTO"
  --petdist=PETDIST     Best distance between Pair-End Tags. Only available
                        when format is 'ELANDMULTIPET'. DEFAULT: 200
  -g GSIZE, --gsize=GSIZE
                        Effective genome size. It can be 1.0e+9 or 1000000000,
                        or shortcuts:'hs' for human (2.7e9), 'mm' for mouse
                        (1.87e9), 'ce' for C. elegans (9e7) and 'dm' for
                        fruitfly (1.2e8), Default:hs
  -s TSIZE, --tsize=TSIZE
                        Tag size. This will overide the auto detected tag
                        size.
  --bw=BW               Band width. This value is only used while building the
                        shifting model. DEFAULT: 300
  -p PVALUE, --pvalue=PVALUE
                        Pvalue cutoff for peak detection. DEFAULT: 1e-5
  -m MFOLD, --mfold=MFOLD
                        Select the regions within MFOLD range of high-
                        confidence enrichment ratio against background to
                        build model. The regions must be lower than upper
                        limit, and higher than the lower limit. DEFAULT:10,30
  --nolambda            If True, MACS will use fixed background lambda as
                        local lambda for every peak region. Normally, MACS
                        calculates a dynamic local lambda to reflect the local
                        bias due to potential chromatin structure.
  --slocal=SMALLLOCAL   The small nearby region in basepairs to calculate
                        dynamic lambda. This is used to capture the bias near
                        the peak summit region. Invalid if there is no control
                        data. DEFAULT: 1000
  --llocal=LARGELOCAL   The large nearby region in basepairs to calculate
                        dynamic lambda. This is used to capture the surround
                        bias. DEFAULT: 10000.
  --on-auto             Whether turn on the auto pair model process. If set,
                        when MACS failed to build paired model, it will use
                        the nomodel settings, the '--shiftsize' parameter to
                        shift and extend each tags. DEFAULT: False
  --nomodel             Whether or not to build the shifting model. If True,
                        MACS will not build model. by default it means
                        shifting size = 100, try to set shiftsize to change
                        it. DEFAULT: False
  --shiftsize=SHIFTSIZE
                        The arbitrary shift size in bp. When nomodel is true,
                        MACS will use this value as 1/2 of fragment size.
                        DEFAULT: 100
  --keep-dup=KEEPDUPLICATES
                        It controls the MACS behavior towards duplicate tags
                        at the exact same location -- the same coordination
                        and the same strand. The 'auto' option makes MACS
                        calculate the maximum tags at the exact same location
                        based on binomal distribution using 1e-5 as pvalue
                        cutoff; and the 'all' option keeps every tags. If an
                        integer is given, at most this number of tags will be
                        kept at the same location. Default: 1. To only keep
                        one performs the best in terms of detecting enriched
                        regions, from our internal study.
  --to-large            When set, scale the small sample up to the bigger
                        sample. By default, the bigger dataset will be scaled
                        down towards the smaller dataset, which will lead to
                        smaller pvalues and more specific results. Keep in
                        mind that scaling down will bring down background
                        noise more. DEFAULT: False
  -w, --wig             Whether or not to save extended fragment pileup at
                        every WIGEXTEND bps into a wiggle file. When --single-
                        profile is on, only one file for the whole genome is
                        saved. WARNING: this process is time/space consuming!!
  -B, --bdg             Whether or not to save extended fragment pileup at
                        every bp into a bedGraph file. When it's on, -w,
                        --space and --call-subpeaks will be ignored. When
                        --single-profile is on, only one file for the whole
                        genome is saved. WARNING: this process is time/space
                        consuming!!
  -S, --single-profile  When set, a single wiggle file will be saved for
                        treatment and input. Default: False
  --space=SPACE         The resoluation for saving wiggle files, by default,
                        MACS will save the raw tag count every 10 bps. Usable
                        only with '--wig' option.
  --call-subpeaks       If set, MACS will invoke Mali Salmon's PeakSplitter
                        soft through system call. If PeakSplitter can't be
                        found, an instruction will be shown for downloading
                        and installing the PeakSplitter package. -w option
                        needs to be on and -B should be off to let it work.
                        DEFAULT: False
  --verbose=VERBOSE     Set verbose level. 0: only show critical message, 1:
                        show additional warning message, 2: show process
                        information, 3: show debug messages. DEFAULT:2
  --diag                Whether or not to produce a diagnosis report. It's up
                        to 9X time consuming. Please check 00README file for
                        detail. DEFAULT: False
  --fe-min=FEMIN        For diagnostics, min fold enrichment to consider.
                        DEFAULT: 0
  --fe-max=FEMAX        For diagnostics, max fold enrichment to consider.
                        DEFAULT: maximum fold enrichment
  --fe-step=FESTEP      For diagnostics, fold enrichment step.  DEFAULT: 20


CEASによるピーク領域へのアノテーション
オプションはほとんどデフォルトで

$ ceas  -help
/Library/Python/2.7/site-packages/CEAS/inout.py:64: UserWarning: sqlite3 is used instead of MySQLdb because MySQLdb is not installed
  warnings.warn("sqlite3 is used instead of MySQLdb because MySQLdb is not installed")
Usage: ceas < input files > [options]

CEAS (Cis-regulatory Element Annotation System)

Options:
  --version             show program's version number and exit
  -h, --help            Show this help message and exit.
  -b BED, --bed=BED     BED file of ChIP regions.
  -w WIG, --wig=WIG     WIG file for either wig profiling or genome background
                        annotation. WARNING: --bg flag must be set for genome
                        background re-annotation.
  -e EBED, --ebed=EBED  BED file of extra regions of interest (eg, non-coding
                        regions)
  -g GDB, --gt=GDB      Gene annotation table (eg, a refGene table in sqlite3
                        db format provided through the CEAS web,
                        http://liulab.dfci.harvard.edu/CEAS/download.html).
  --name=NAME           Experiment name. This will be used to name the output
                        files. If an experiment name is not given, the stem of
                        the input BED file name will be used instead (eg, if
                        'peaks.bed', 'peaks' will be used as a name.)
  --sizes=SIZES         Promoter (also dowsntream) sizes for ChIP region
                        annotation. Comma-separated three values or a single
                        value can be given. If a single value is given, it
                        will be segmented into three equal fractions (ie, 3000
                        is equivalent to 1000,2000,3000), DEFAULT:
                        1000,2000,3000. WARNING: Values > 10000bp are
                        automatically set to 10000bp.
  --bisizes=BISIZES     Bidirectional-promoter sizes for ChIP region
                        annotation Comma-separated two values or a single
                        value can be given. If a single value is given, it
                        will be segmented into two equal fractions (ie, 5000
                        is equivalent to 2500,5000) DEFAULT: 2500,5000bp.
                        WARNING: Values > 20000bp are automatically set to
                        20000bp.
  --bg                  Run genome BG annotation again. WARNING: This flag is
                        effective only if a WIG file is given through -w
                        (--wig). Otherwise, ignored.
  --span=SPAN           Span from TSS and TTS in the gene-centered annotation.
                        ChIP regions within this range from TSS and TTS are
                        considered when calculating the coverage rates in
                        promoter and downstream, DEFAULT=3000bp
  --pf-res=PF_RES       Wig profiling resolution, DEFAULT: 50bp. WARNING:
                        Value smaller than the wig interval (resolution) may
                        cause aliasing error.
  --rel-dist=REL_DIST   Relative distance to TSS/TTS in wig profiling,
                        DEFAULT: 3000bp
  --gn-groups=GN_GROUPS
                        Gene-groups of particular interest in wig profiling.
                        Each gene group file must have gene names in the 1st
                        column. The file names are separated by commas w/ no
                        space (eg, --gn-groups=top10.txt,bottom10.txt)
  --gn-group-names=GN_NAMES
                        The names of the gene groups in --gn-groups. The gene
                        group names are separated by commas. (eg, --gn-group-
                        names='top 10%,bottom 10%'). These group names appear
                        in the legends of the wig profiling plots. If no group
                        names given, the groups are represented as 'Group 1,
                        Group2,...Group n'.
  --gname2              Whether or not use the 'name2' column of the gene
                        annotation table when reading the gene IDs in the
                        files given through --gn-groups. This flag is
                        meaningful only with --gn-groups.
  --dump                Whether to save the raw profiles of near TSS, TTS, and
                        gene body. The file names have a suffix of 'TSS',
                        'TTS', and 'gene' after the name.