DEXSeqもう一度 - kuroの覚え書き

今回は完全にDEXSeqだけのレシピでまずやってみる。
http://bioconductor.org/packages/release/bioc/vignettes/DEXSeq/inst/doc/DEXSeq.pdf
環境はHGCのスパコン。

pythonの環境は構築したのでRについてちょっと整備。
これまでRはローカルのMacでやっていたが、この際すべてスパコン上でやってしまいたい。

ということでDEXSeqをHGCにインストールする。
ホームディレクトリにインストールするため~/.Rディレクトリを作成
http://bioconductor.org/packages/release/bioc/src/contrib/DEXSeq_1.22.0.tar.gz
からソースをダウンロードして

$ R CMD INSTALL DEXSeq_1.22.0.tar.gz
* installing to library ‘/yshare1/home/hoge/.R’
ERROR: dependency ‘statmod’ is not available for package ‘DEXSeq’
* removing ‘/yshare1/home/hoge/.R/DEXSeq’

とエラーが出る。statmodがないからのようなので

$ wget https://cran.rstudio.com/src/contrib/statmod_1.4.30.tar.gz
--2017-07-19 12:59:06--  https://cran.rstudio.com/src/contrib/statmod_1.4.30.tar.gz
cran.rstudio.com をDNSに問いあわせています... 52.85.1.11
cran.rstudio.com|52.85.1.11|:443 に接続しています... 接続しました。
HTTP による接続要求を送信しました、応答を待っています... 200 OK
長さ: 57309 (56K) [application/x-gzip]
`statmod_1.4.30.tar.gz' に保存中

100%[============================================================================>] 57,309      --.-K/s 時間 0.001s  

2017-07-19 12:59:06 (81.9 MB/s) - `statmod_1.4.30.tar.gz' へ保存完了 [57309/57309]

$ R CMD INSTALL statmod_1.4.30.tar.gz 
* installing to library ‘/yshare1/home/hoge/.R’
* installing *source* package ‘statmod’ ...
**  パッケージ ‘statmod’ の解凍および MD5 サムの検証に成功しました 
** libs
/usr/local/bin/gfortran   -fpic  -I/usr/local/include  -c gaussq2.f -o gaussq2.o
/usr/local/bin/gcc -std=gnu99 -I/usr/local/package/r/3.4.1/lib64/R/include -DNDEBUG   -I/usr/local/include   -fpic  -I/usr/local/include  -c init.c -o init.o
/usr/local/bin/gcc -std=gnu99 -shared -L/usr/local/package/r/3.4.1/lib64/R/lib -L/usr/local/lib64 -L/usr/local/lib -o statmod.so gaussq2.o init.o -lgfortran -lm -lquadmath -L/usr/local/package/r/3.4.1/lib64/R/lib -lR
installing to /yshare1/home/hoge/.R/statmod/libs
** R
** data
** inst
** preparing package for lazy loading
** help
*** installing help indices
** building package indices
** testing if installed package can be loaded
* DONE (statmod)

$ R CMD INSTALL DEXSeq_1.22.0.tar.gz 
* installing to library ‘/yshare1/home/hoge/.R’
* installing *source* package ‘DEXSeq’ ...
** R
** inst
** preparing package for lazy loading
** help
*** installing help indices
** building package indices
** installing vignettes
** testing if installed package can be loaded
* DONE (DEXSeq)

これでインストールできた。

$ ls .R
DEXSeq  statmod
$ ls .R/DEXSeq/
CITATION  DESCRIPTION  INDEX  Meta  NAMESPACE  NEWS  R  doc  help  html  python_scripts

Rのパッケージに含まれるpythonのスクリプト

$ ls .R/DEXSeq/python_scripts/
dexseq_count.py  dexseq_prepare_annotation.py

このdexseq_prepare_annotation.pyを使ってアノテーションファイルの作成を行う。

#!/bin/sh

#$ -S /bin/bash
#$ -cwd

python /home/hoge/.R/DEXSeq/python_scripts/dexseq_prepare_annotation.py \
        /home/hoge/genome_reference/Homo_sapiens/Ensembl/GRCh37/Annotation/Archives/archive-current/Genes/genes.gtf \
        /home/hoge/genome_reference/Homo_sapiens/Ensembl/GRCh37/genes.gff

そしてリードカウントをcontrolとknockdownについて実施

#!/bin/sh

#$ -S /bin/bash
#$ -cwd

python /home/hoge/.R/DEXSeq/python_scripts/dexseq_count.py -f bam -s yes \
        /home/hoge/genome_reference/Homo_sapiens/Ensembl/GRCh37/genes.gff \
        /home/hoge/SRSF10/hisat_results/control/control.sort.bam \
        /home/hoge/SRSF10/hisat_results/control/control_fb170719.txt

$ head SRSF10/hisat_results/control/control_fb170719.txt 
ENSG00000000003:001	288
ENSG00000000003:002	128
ENSG00000000003:003	108
ENSG00000000003:004	79
ENSG00000000003:005	34
ENSG00000000003:006	25
ENSG00000000003:007	14
ENSG00000000003:008	17
ENSG00000000003:009	122
ENSG00000000003:010	5

#!/bin/sh

#$ -S /bin/bash
#$ -cwd

python /home/hoge/.R/DEXSeq/python_scripts/dexseq_count.py -f bam -p yes\
        /home/hoge/genome_reference/Homo_sapiens/Ensembl/GRCh37/genes.gff \
        /home/hoge/SRSF10/hisat_results/knockdown/knockdown.sort.bam \
        /home/hoge/SRSF10/hisat_results/knockdown/knockdown_fb170719.txt

$ head SRSF10/dexseq/knockdown_fb170719.txt 
ENSG00000000003:001	2
ENSG00000000003:002	0
ENSG00000000003:003	0
ENSG00000000003:004	0
ENSG00000000003:005	0
ENSG00000000003:006	0
ENSG00000000003:007	0
ENSG00000000003:008	0
ENSG00000000003:009	0
ENSG00000000003:010	0

controlの方はそこそこ大きい数値が出てきたのに、knockdownの方はほとんど０。なにかおかしいような感じがする。

controlの方ペアエンドのオプション　-pを-sと書き間違えている。やり直したらcontrolのほうも殆どゼロのファイルとなった。
それはそれでいいのか謎。

とにかくスパコンでRをバッチで実行させる

r_batch.Rというスクリプトに

inDir <- "~/SRSF10/dexseq"
countFiles <- list.files(inDir, pattern="fb.txt$", full.names=TRUE)
basename(countFiles)
flattenedFile <- list.files("~/genome_reference/Homo_sapiens/Ensembl/GRCh37", pattern="gff$", full.names=TRUE)
basename(flattenedFile)
samp <- data.frame(row.names = c("SRR1271845","SRR1271846"), condition = rep(c("control","knockdown"),each=1))
suppressPackageStartupMessages( library( "DEXSeq" ) )
dxd <- DEXSeqDataSetFromHTSeq(countFiles, sampleData=samp, design= ~ sample + exon + condition:exon, flattenedfile=flattenedFile )

colData(dxd)

#access the first 5 rows from the count data
head( counts(dxd), 5 )

#Normalisation
dxd <- estimateSizeFactors( dxd )

#Dispersion estimation
dxd <- estimateDispersions( dxd )

png(filename = "test.png", width = 600, height = 600)
plotDispEsts(dxd)
dev.off()

# Test for differential exon usage in each gene
dxd <- testForDEU(dxd)

# Estimate relative exon usage fold changes
dxd <- estimateExonFoldChanges(dxd, fitExpToVar="condition")

# Output DEXSeq result
dxr1 <- DEXSeqResults(dxd)
write.table(dxr1, quote = F, sep = "\t", file = "DEXSeq_test_result.txt")

このように書いておき、
qsubで

#!bin/sh
#$ -S /bin/bash
#$ -cwd

R CMD BATCH ~/r_batch.R

を実行する。

すると、またまたエラー・・・

R version 3.4.1 (2017-06-30) -- "Single Candle"
Copyright (C) 2017 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu (64-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.

R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.

> inDir <- "~/SRSF10/dexseq"
> countFiles <- list.files(inDir, pattern="fb.txt$", full.names=TRUE)
> basename(countFiles)
[1] "control_170720fb.txt"   "knockdown_170719fb.txt"
> flattenedFile <- list.files("~/genome_reference/Homo_sapiens/Ensembl/GRCh37", pattern="gff$", full.names=TRUE)
> basename(flattenedFile)
[1] "genes.gff"
> samp <- data.frame(row.names = c("SRR1271845","SRR1271846"), condition = rep(c("control","knockdown"),each=1))
> suppressPackageStartupMessages( library( "DEXSeq" ) )
> dxd <- DEXSeqDataSetFromHTSeq(countFiles, sampleData=samp, design= ~ sample + exon + condition:exon, flattenedfile=flattenedFile )
converting counts to integer mode
> 
> colData(dxd)
DataFrame with 4 rows and 3 columns
      sample condition     exon
    <factor>  <factor> <factor>
1 SRR1271845   control     this
2 SRR1271846 knockdown     this
3 SRR1271845   control   others
4 SRR1271846 knockdown   others
> 
> #access the first 5 rows from the count data
> head( counts(dxd), 5 )
                     [,1] [,2] [,3] [,4]
ENSG00000000003:E001    0    2    0    0
ENSG00000000003:E002    0    0    0    2
ENSG00000000003:E003    0    0    0    2
ENSG00000000003:E004    0    0    0    2
ENSG00000000003:E005    0    0    0    2
>

ここまでは順調。

> #Normalisation
> dxd <- estimateSizeFactors( dxd )
Note: method with signature 'RangedSummarizedExperiment#DataFrame' chosen for function 'colData<-',
 target signature 'DEXSeqDataSet#DataFrame'.
 "SummarizedExperiment#DataFrame" would also be valid
> 
> #Dispersion estimation
> dxd <- estimateDispersions( dxd )
using supplied model matrix
Error in estimateDispersionsFit(object, fitType = fitType, quiet = quiet) : 
  all gene-wise dispersion estimates are within 2 orders of magnitude
  from the minimum value, and so the standard curve fitting techniques will not work.
  One can instead use the gene-wise estimates as final estimates:
  dds <- estimateDispersionsGeneEst(dds)
  dispersions(dds) <- mcols(dds)$dispGeneEst
  ...then continue with testing using nbinomWaldTest or nbinomLRT
Calls: estimateDispersions ... estimateDispersions -> .local -> estimateDispersionsFit
Execution halted

ここで終了。

結局のところread countがちゃんとできていないことが根本要因のような。
マニュアルをよーく読むとBAM(SAM)はソートされていて、何でソートしたかを-rオプションで入れる必要がありそうな。
samtools sortは-nをつけない限りポジションでソートしているはずなので-r posが必要なのかも。
しかしこのオプションつけて実行するとすごい数のエラーを吐かれたのだが。

逆にこのオプションを付けてないと

/usr/local/package/python2.7/current/lib/python2.7/site-packages/HTSeq/__init__.py:589: UserWarning: Read SRR1271845.20831815 claims to have an aligned mate which could not be found in an adjacent line.
  "which could not be found in an adjacent line." )
100000 reads processed.
200000 reads processed.
300000 reads processed.
400000 reads processed.
500000 reads processed.
600000 reads processed.
/usr/local/package/python2.7/current/lib/python2.7/site-packages/HTSeq/__init__.py:622: UserWarning: 62096308 reads with missing mate encountered.
  warnings.warn( "%d reads with missing mate encountered." % mate_missing_count[0] )

というログが残っている。よく見ると60万リードしか処理できてなくて残りの6200万リードあまりはmissing mateとなっている。
これではいけない。大量にエラーが出ても6200万リードを全部処理してもらわないと。

R version 3.4.1 (2017-06-30) -- "Single Candle"
Copyright (C) 2017 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu (64-bit)

R は、自由なソフトウェアであり、「完全に無保証」です。 
一定の条件に従えば、自由にこれを再配布することができます。 
配布条件の詳細に関しては、'license()' あるいは 'licence()' と入力してください。 

R は多くの貢献者による共同プロジェクトです。 
詳しくは 'contributors()' と入力してください。 
また、R や R のパッケージを出版物で引用する際の形式については 
'citation()' と入力してください。 

'demo()' と入力すればデモをみることができます。 
'help()' とすればオンラインヘルプが出ます。 
'help.start()' で HTML ブラウザによるヘルプがみられます。 
'q()' と入力すれば R を終了します。 

> inDir <- "~/SRSF10/dexseq"
> countFiles <- list.files(inDir, pattern="fb.txt$", full.names=TRUE)
> basename(countFiles)
[1] "SRR1271845fb.txt" "SRR1271846fb.txt"
> flattenedFile <- list.files("~/genome_reference/Homo_sapiens/Ensembl/GRCh37", pattern="gff$", full.names=TRUE)
> basename(flattenedFile)
[1] "genes.gff"
> samp <- data.frame(row.names = c("SRR1271845","SRR1271846"), condition = rep(c("control","knockdown"),each=1))
> suppressPackageStartupMessages( library( "DEXSeq" ) )
> dxd <- DEXSeqDataSetFromHTSeq(countFiles, sampleData=samp, design= ~ sample + exon + condition:exon, flattenedfile=flattenedFile )
converting counts to integer mode
> 
> colData(dxd)
DataFrame with 4 rows and 3 columns
      sample condition     exon
    <factor>  <factor> <factor>
1 SRR1271845   control     this
2 SRR1271846 knockdown     this
3 SRR1271845   control   others
4 SRR1271846 knockdown   others
> 
> #access the first 5 rows from the count data
> head( counts(dxd), 5 )
                     [,1] [,2] [,3] [,4]
ENSG00000000003:E001  197  165  537  387
ENSG00000000003:E002   76   55  658  497
ENSG00000000003:E003   62   40  672  512
ENSG00000000003:E004   46   23  688  529
ENSG00000000003:E005   19    5  715  547
> 
> #Normalisation
> dxd <- estimateSizeFactors( dxd )
Note: method with signature ‘RangedSummarizedExperiment#DataFrame’ chosen for function ‘colData<-’,
 target signature ‘DEXSeqDataSet#DataFrame’.
 "SummarizedExperiment#DataFrame" would also be valid
> 
> #Dispersion estimation
> dxd <- estimateDispersions( dxd )
using supplied model matrix
 estimateDispersionsFit(object, fitType = fitType, quiet = quiet) でエラー: 
  all gene-wise dispersion estimates are within 2 orders of magnitude
  from the minimum value, and so the standard curve fitting techniques will not work.
  One can instead use the gene-wise estimates as final estimates:
  dds <- estimateDispersionsGeneEst(dds)
  dispersions(dds) <- mcols(dds)$dispGeneEst
  ...then continue with testing using nbinomWaldTest or nbinomLRT
 呼び出し:  estimateDispersions ... estimateDispersions -> .local -> estimateDispersionsFit
 実行が停止されました

やはりエラーで止まる。

結局のところ今回もちいているデータセットでは統計的に有意となるものが全く引っかからないようで、それが統計処理まえのマッピングに原因があるのか、マッピングされたリードのカウント方法に問題があるのか、アノテーションファイルに問題があるのか、等の切り分けができていないことに最も重大な問題があるようだ。