RNA-seq解析方法検討 - kuroの覚え書き

散漫的にやっていても埒が明かないので
実績のあるサンプルデータを使って、論文に書かれた方法を一からトレースしてみることにする。

https://academic.oup.com/bib/article/doi/10.1093/bib/bbx034/3108818/CASH-a-constructing-comprehensive-splice-site

この論文を参考にしてみる。
CASHというAlternative splicing解析ツールを他のツール（MISO、rMAT、DEXSeq、cufflinks)と比較しているようだ。
使用Cell line: Human RKO cell
Target gene: SRSF10

解析に用いるデータセットは
https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE%2057325
SRR1271845: control
SRR1271846: knockdown mutant

NCBIからダウンロードしたファイルは破損しているのかSRAtoolsのfastq-dumpで展開できなかったので
http://www.ebi.ac.uk/ena/data/view/SRX533524
http://www.ebi.ac.uk/ena/data/view/SRX533525
ここからfastqファイルをダウンロードしてきた。
DDBJからもダウンロードできるようだ。
http://trace.ddbj.nig.ac.jp/dra/index.html

QC & Trimming

それぞれペアエンドをセットでTrimmomaticでトリミング実施（クオリティ20以上30b以上）
SRR1271845_1
before Total Sequences 32601421
after Total Sequences 32277191
SRR1271845_2
before Total Sequences 32601421
after Total Sequences 32277191

SRR1271846_1
before Total Sequences 32052020
after Total Sequences 31760211
SRR1271846_2
before Total Sequences 32052020
after Total Sequences 31760211

HISAT2でマッピング（リファレンスはGRCh37を使用する）

ftp://ftp.ccb.jhu.edu/pub/infphilo/hisat2/data/grch37_snp_tran.tar.gz
トリミングしないバージョンも同時に試すことにする。
同時進行でcufflinksも実施しておく。

cuffmerge -> cuffdiff

これまでにやってきた手順で問題なく終了。
データをどう読み解くか
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3869392/
この論文をよく読む。
cuffdiff_resultsに生成するファイル

- differential expression test

gene_exp_diff, cds_exp.diff, isoform_exp.diff　発現量の差があるかの統計解析
今回のテスト実験ではそれぞれ28, 15, 0の有意判定がついていた。

- Differential splicing tests - splicing.diff

今回のテスト実験では有意判定の付いたものは0
やっぱり検出力が低いということか。
なお、トリミングなしでマッピングしたとき発現量で有意となるものは若干増えるがsplicingに関しては0で変わらず。

DEXSeq(http://bioconductor.org/packages/release/bioc/html/DEXSeq.html)による解析

http://bioconductor.org/packages/release/bioc/vignettes/DEXSeq/inst/doc/DEXSeq.pdf
に従うものとする。（ただし、マッピングはHISAT2)
DEXSeqにかける前のステップはfeatureCountで試してみているが、エラーが解消されないので以下の方法も試してみる。
HTseqのインストール
前にアノテーションファイルを作るときにdexseq_prepare_annotation2.py（Subread_to_DEXSeq-master　https://github.com/vivekbhr/Subread_to_DEXSeq）を使用しようとしたらHTseqを要求されたのでインストール済みである。

- Preparing the annotation

/Library/Frameworks/R.framework/Versions/3.3/Resources/library/DEXSeq/python_scripts/dexseq_prepare_annotation.py
前回dexseq_prepare_annotation2.pyで作成したのでそれを使ってみる。だめならここに戻る。2のほうはfeatureCountがgffでなくgtfを使うので、gtfも書き出せるように拡張したものと思われる。gffは同じものができると思う。

- Counting read

/Library/Frameworks/R.framework/Versions/3.3/Resources/library/DEXSeq/python_scripts/dexseq_count.py
ペアエンドのとき -p yes
BAMファイルのとき -f bam
（デフォルトはSAM)　BAMを使うときはpysamをインストールしておかなければならない。
BAM（SAM)はsortされていなければならない。samtoolsでsortしたファイルを使う。

$ python /Library/Frameworks/R.framework/Versions/3.3/Resources/library/DEXSeq/python_scripts/dexseq_count.py -p yes -f bam ~/local/bin/Subread_to_DEXSeq-master/genesDEXSeq.gff ./hisat_results/SRR1271846/SRR1271846.sort.bam SRR1271846_fb.txt
/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/HTSeq-0.8.0-py2.7-macosx-10.6-intel.egg/HTSeq/__init__.py:589: UserWarning: Read SRR1271846.7154720 claims to have an aligned mate which could not be found in an adjacent line.
  "which could not be found in an adjacent line." )
100000 reads processed.
200000 reads processed.
300000 reads processed.
400000 reads processed.
500000 reads processed.
600000 reads processed.
/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/HTSeq-0.8.0-py2.7-macosx-10.6-intel.egg/HTSeq/__init__.py:622: UserWarning: 62096308 reads with missing mate encountered.
  warnings.warn( "%d reads with missing mate encountered." % mate_missing_count[0] )

ここからR

> inDir = "~/expression/SRSF10"
> countFiles = list.files(inDir, pattern="fb.txt$", full.names=TRUE)
> basename(countFiles)
[1] "SRR1271845_fb.txt" "SRR1271846_fb.txt"
> flattenedFile = list.files("~/local/bin/Subread_to_DEXSeq-master", pattern="gff$", full.names=TRUE)
> basename(flattenedFile)
[1] "genesDEXSeq.gff"
> samp <- data.frame(row.names = c("control","mutant"),
+ condition = rep(c("control","knockdown"),each=1))
> suppressPackageStartupMessages( library( "DEXSeq" ) )
> dxd = DEXSeqDataSetFromHTSeq(
+     countFiles,
+     sampleData=samp,
+     design= ~ sample + exon + condition:exon,
+     flattenedfile=flattenedFile )
converting counts to integer mode

> colData(dxd)
DataFrame with 4 rows and 3 columns
    sample condition     exon
  <factor>  <factor> <factor>
1  control   control     this
2   mutant knockdown     this
3  control   control   others
4   mutant knockdown   others
> head( counts(dxd), 2 )
                     [,1] [,2] [,3] [,4]
ENSG00000000003:E001    0    2    0    0
ENSG00000000003:E002    0    0    0    2
> head( counts(dxd), 5 )
                     [,1] [,2] [,3] [,4]
ENSG00000000003:E001    0    2    0    0
ENSG00000000003:E002    0    0    0    2
ENSG00000000003:E003    0    0    0    2
ENSG00000000003:E004    0    0    0    2
ENSG00000000003:E005    0    0    0    2
> split( seq_len(ncol(dxd)), colData(dxd)$exon )
$this
[1] 1 2

$others
[1] 3 4

> head( featureCounts(dxd), 5 )
                     control mutant
ENSG00000000003:E001       0      2
ENSG00000000003:E002       0      0
ENSG00000000003:E003       0      0
ENSG00000000003:E004       0      0
ENSG00000000003:E005       0      0
> head( rowRanges(dxd), 3 )
GRanges object with 3 ranges and 5 metadata columns:
                       seqnames               ranges strand |   featureID
                          <Rle>            <IRanges>  <Rle> | <character>
  ENSG00000000003:E001        X [99883667, 99884983]      - |        E001
  ENSG00000000003:E002        X [99885756, 99885863]      - |        E002
  ENSG00000000003:E003        X [99887482, 99887537]      - |        E003
                               groupID exonBaseMean exonBaseVar
                           <character>    <numeric>   <numeric>
  ENSG00000000003:E001 ENSG00000000003            1           2
  ENSG00000000003:E002 ENSG00000000003            0           0
  ENSG00000000003:E003 ENSG00000000003            0           0
                           transcripts
                                <list>
  ENSG00000000003:E001 ENST00000373020
  ENSG00000000003:E002 ENST00000373020
  ENSG00000000003:E003 ENST00000373020
  -------
  seqinfo: 265 sequences from an unspecified genome; no seqlengths

なんか数字が小さすぎるような・・・
やっぱりどこかおかしいのか？