kuroの覚え書き

96の個人的覚え書き

featureCount→DEXSeq

featureCountの方もまだ不完全だがDEXSeqも手付かずなのでまずは環境構築

$ dexseq_prepare_annotation2.py -f DEXSeq.gtf expression/Homo_sapiens/NCBI/build37.2/Annotation/Archives/archive-2014-06-02-13-47-29/Genes/genes.gtf genes.DEXSeq.gtf
Could not import HTSeq. Please install the HTSeq Python framework
available from http://www-huber.embl.de/users/anders/HTSeq

早速エラー
HTseqがないと。

$ pip install HTseq
Collecting HTseq
  Downloading HTSeq-0.8.0.tar.gz (561kB)
    100% |========================================| 563kB 2.3MB/s 
Could not import setuptools which is required to install from a source distribution.
Please install setuptools.

インストールできないって言うし。
結局手動でダウンロードしてきてそのディレクトリで

$ python setup.py build
symlinking folders for python2
Could not import 'setuptools', falling back to 'distutils'.
Traceback (most recent call last):
  File "setup.py", line 193, in <module>
    **kwargs
  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/distutils/core.py", line 111, in setup
    _setup_distribution = dist = klass(attrs)
  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/distutils/dist.py", line 259, in __init__
    getattr(self.metadata, "set_" + key)(val)
  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/distutils/dist.py", line 1220, in set_requires
    distutils.versionpredicate.VersionPredicate(v)
  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/distutils/versionpredicate.py", line 113, in __init__
    raise ValueError("expected parenthesized list: %r" % paren)
ValueError: expected parenthesized list: '>=0.9.0'

今度はなんだ?setuptoolsがインポートできない?
ValueErrorってのも気になるが。setup.pyの中身をざっと見ると

        setup_requires=[
              'Cython',
              'numpy',
              'pysam>=0.9.0',
        ],

こんな記述があって、エラーの中で0.9.0がどうのとなっているということは、この3つのパッケージもいるってことかな。
というわけでpip installで入れておく。

$ pip install Cython numpy pysum setuptools

$ python setup.py build
symlinking folders for python2
running build
running build_py
running preprocess
cythonizing
Cython version 0.25.2
SWIGging
/bin/sh: swig: command not found
SWIG not found, but transpiled files found
moving swigged .py module
done
creating build
creating build/lib.macosx-10.6-intel-2.7
creating build/lib.macosx-10.6-intel-2.7/HTSeq
copying HTSeq/__init__.py -> build/lib.macosx-10.6-intel-2.7/HTSeq
copying HTSeq/_HTSeq_internal.py -> build/lib.macosx-10.6-intel-2.7/HTSeq
copying HTSeq/StepVector.py -> build/lib.macosx-10.6-intel-2.7/HTSeq
copying HTSeq/_version.py -> build/lib.macosx-10.6-intel-2.7/HTSeq
creating build/lib.macosx-10.6-intel-2.7/HTSeq/scripts
copying HTSeq/scripts/__init__.py -> build/lib.macosx-10.6-intel-2.7/HTSeq/scripts
copying HTSeq/scripts/qa.py -> build/lib.macosx-10.6-intel-2.7/HTSeq/scripts
copying HTSeq/scripts/count.py -> build/lib.macosx-10.6-intel-2.7/HTSeq/scripts
running build_ext
building 'HTSeq._HTSeq' extension
creating build/temp.macosx-10.6-intel-2.7
creating build/temp.macosx-10.6-intel-2.7/src
/usr/bin/clang -fno-strict-aliasing -fno-common -dynamic -arch i386 -arch x86_64 -g -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -I/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/numpy/core/include -I/Library/Frameworks/Python.framework/Versions/2.7/include/python2.7 -c src/_HTSeq.c -o build/temp.macosx-10.6-intel-2.7/src/_HTSeq.o -w
/usr/bin/clang -bundle -undefined dynamic_lookup -arch i386 -arch x86_64 -g build/temp.macosx-10.6-intel-2.7/src/_HTSeq.o -o build/lib.macosx-10.6-intel-2.7/HTSeq/_HTSeq.so
building 'HTSeq._StepVector' extension
/usr/bin/clang -fno-strict-aliasing -fno-common -dynamic -arch i386 -arch x86_64 -g -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -I/Library/Frameworks/Python.framework/Versions/2.7/include/python2.7 -c src/StepVector_wrap.cxx -o build/temp.macosx-10.6-intel-2.7/src/StepVector_wrap.o -w
/usr/bin/clang++ -bundle -undefined dynamic_lookup -arch i386 -arch x86_64 -g build/temp.macosx-10.6-intel-2.7/src/StepVector_wrap.o -o build/lib.macosx-10.6-intel-2.7/HTSeq/_StepVector.so
clang: warning: libstdc++ is deprecated; move to libc++ with a minimum deployment target of OS X 10.9
clang: warning: libstdc++ is deprecated; move to libc++ with a minimum deployment target of OS X 10.9
running build_scripts
creating build/scripts-2.7
copying and adjusting scripts/htseq-qa -> build/scripts-2.7
copying and adjusting scripts/htseq-count -> build/scripts-2.7
changing mode of build/scripts-2.7/htseq-qa from 644 to 755
changing mode of build/scripts-2.7/htseq-count from 644 to 755

HTseqは入ったな。
では最初に戻って

$ python local/bin/Subread_to_DEXSeq-master/dexseq_prepare_annotation2.py -f DEXSeq.gtf expression/Homo_sapiens/NCBI/build37.2/Annotation/Archives/archive-2014-06-02-13-47-29/Genes/genes.gtf genes.DEXSeq.gtf
/Library/Frameworks/Python.framework/Versions/2.7/Resources/Python.app/Contents/MacOS/Python: can't open file 'local/bin/Subread_to_DEXSeq-master/dexseq_prepare_annotation2.py': [Errno 2] No such file or directory

まだだめらしい。

$ python setup.py install
symlinking folders for python2
running install
running bdist_egg
running egg_info
writing requirements to HTSeq.egg-info/requires.txt
writing HTSeq.egg-info/PKG-INFO
writing top-level names to HTSeq.egg-info/top_level.txt
writing dependency_links to HTSeq.egg-info/dependency_links.txt
file HTSeq/StepVector.py (for module HTSeq.StepVector) not found
reading manifest file 'HTSeq.egg-info/SOURCES.txt'
reading manifest template 'MANIFEST.in'
warning: no files found matching 'README.md'
warning: no previously-included files found matching 'todo.txt'
no previously-included directories found matching 'example_data'
no previously-included directories found matching 'doc'
no previously-included directories found matching 'test'
writing manifest file 'HTSeq.egg-info/SOURCES.txt'
installing library code to build/bdist.macosx-10.6-intel/egg
running install_lib
running build_py
running preprocess
cythonizing
Cython version 0.25.2
SWIGging
/bin/sh: swig: command not found
SWIG not found, but transpiled files found
moving swigged .py module
done
creating build
creating build/lib.macosx-10.6-intel-2.7
creating build/lib.macosx-10.6-intel-2.7/HTSeq
copying HTSeq/__init__.py -> build/lib.macosx-10.6-intel-2.7/HTSeq
copying HTSeq/_HTSeq_internal.py -> build/lib.macosx-10.6-intel-2.7/HTSeq
copying HTSeq/StepVector.py -> build/lib.macosx-10.6-intel-2.7/HTSeq
copying HTSeq/_version.py -> build/lib.macosx-10.6-intel-2.7/HTSeq
creating build/lib.macosx-10.6-intel-2.7/HTSeq/scripts
copying HTSeq/scripts/__init__.py -> build/lib.macosx-10.6-intel-2.7/HTSeq/scripts
copying HTSeq/scripts/qa.py -> build/lib.macosx-10.6-intel-2.7/HTSeq/scripts
copying HTSeq/scripts/count.py -> build/lib.macosx-10.6-intel-2.7/HTSeq/scripts
running build_ext
building 'HTSeq._HTSeq' extension
creating build/temp.macosx-10.6-intel-2.7
creating build/temp.macosx-10.6-intel-2.7/src
/usr/bin/clang -fno-strict-aliasing -fno-common -dynamic -arch i386 -arch x86_64 -g -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -I/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/numpy/core/include -I/Library/Frameworks/Python.framework/Versions/2.7/include/python2.7 -c src/_HTSeq.c -o build/temp.macosx-10.6-intel-2.7/src/_HTSeq.o -w
/usr/bin/clang -bundle -undefined dynamic_lookup -arch i386 -arch x86_64 -g build/temp.macosx-10.6-intel-2.7/src/_HTSeq.o -o build/lib.macosx-10.6-intel-2.7/HTSeq/_HTSeq.so
building 'HTSeq._StepVector' extension
/usr/bin/clang -fno-strict-aliasing -fno-common -dynamic -arch i386 -arch x86_64 -g -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -I/Library/Frameworks/Python.framework/Versions/2.7/include/python2.7 -c src/StepVector_wrap.cxx -o build/temp.macosx-10.6-intel-2.7/src/StepVector_wrap.o -w
/usr/bin/clang++ -bundle -undefined dynamic_lookup -arch i386 -arch x86_64 -g build/temp.macosx-10.6-intel-2.7/src/StepVector_wrap.o -o build/lib.macosx-10.6-intel-2.7/HTSeq/_StepVector.so
clang: warning: libstdc++ is deprecated; move to libc++ with a minimum deployment target of OS X 10.9
clang: warning: libstdc++ is deprecated; move to libc++ with a minimum deployment target of OS X 10.9
creating build/bdist.macosx-10.6-intel
creating build/bdist.macosx-10.6-intel/egg
creating build/bdist.macosx-10.6-intel/egg/HTSeq
copying build/lib.macosx-10.6-intel-2.7/HTSeq/__init__.py -> build/bdist.macosx-10.6-intel/egg/HTSeq
copying build/lib.macosx-10.6-intel-2.7/HTSeq/_HTSeq.so -> build/bdist.macosx-10.6-intel/egg/HTSeq
copying build/lib.macosx-10.6-intel-2.7/HTSeq/_HTSeq_internal.py -> build/bdist.macosx-10.6-intel/egg/HTSeq
copying build/lib.macosx-10.6-intel-2.7/HTSeq/_StepVector.so -> build/bdist.macosx-10.6-intel/egg/HTSeq
copying build/lib.macosx-10.6-intel-2.7/HTSeq/_version.py -> build/bdist.macosx-10.6-intel/egg/HTSeq
creating build/bdist.macosx-10.6-intel/egg/HTSeq/scripts
copying build/lib.macosx-10.6-intel-2.7/HTSeq/scripts/__init__.py -> build/bdist.macosx-10.6-intel/egg/HTSeq/scripts
copying build/lib.macosx-10.6-intel-2.7/HTSeq/scripts/count.py -> build/bdist.macosx-10.6-intel/egg/HTSeq/scripts
copying build/lib.macosx-10.6-intel-2.7/HTSeq/scripts/qa.py -> build/bdist.macosx-10.6-intel/egg/HTSeq/scripts
copying build/lib.macosx-10.6-intel-2.7/HTSeq/StepVector.py -> build/bdist.macosx-10.6-intel/egg/HTSeq
byte-compiling build/bdist.macosx-10.6-intel/egg/HTSeq/__init__.py to __init__.pyc
byte-compiling build/bdist.macosx-10.6-intel/egg/HTSeq/_HTSeq_internal.py to _HTSeq_internal.pyc
byte-compiling build/bdist.macosx-10.6-intel/egg/HTSeq/_version.py to _version.pyc
byte-compiling build/bdist.macosx-10.6-intel/egg/HTSeq/scripts/__init__.py to __init__.pyc
byte-compiling build/bdist.macosx-10.6-intel/egg/HTSeq/scripts/count.py to count.pyc
byte-compiling build/bdist.macosx-10.6-intel/egg/HTSeq/scripts/qa.py to qa.pyc
byte-compiling build/bdist.macosx-10.6-intel/egg/HTSeq/StepVector.py to StepVector.pyc
creating stub loader for HTSeq/_HTSeq.so
creating stub loader for HTSeq/_StepVector.so
byte-compiling build/bdist.macosx-10.6-intel/egg/HTSeq/_HTSeq.py to _HTSeq.pyc
byte-compiling build/bdist.macosx-10.6-intel/egg/HTSeq/_StepVector.py to _StepVector.pyc
creating build/bdist.macosx-10.6-intel/egg/EGG-INFO
installing scripts to build/bdist.macosx-10.6-intel/egg/EGG-INFO/scripts
running install_scripts
running build_scripts
creating build/scripts-2.7
copying and adjusting scripts/htseq-qa -> build/scripts-2.7
copying and adjusting scripts/htseq-count -> build/scripts-2.7
changing mode of build/scripts-2.7/htseq-qa from 644 to 755
changing mode of build/scripts-2.7/htseq-count from 644 to 755
creating build/bdist.macosx-10.6-intel/egg/EGG-INFO/scripts
copying build/scripts-2.7/htseq-count -> build/bdist.macosx-10.6-intel/egg/EGG-INFO/scripts
copying build/scripts-2.7/htseq-qa -> build/bdist.macosx-10.6-intel/egg/EGG-INFO/scripts
changing mode of build/bdist.macosx-10.6-intel/egg/EGG-INFO/scripts/htseq-count to 755
changing mode of build/bdist.macosx-10.6-intel/egg/EGG-INFO/scripts/htseq-qa to 755
copying HTSeq.egg-info/PKG-INFO -> build/bdist.macosx-10.6-intel/egg/EGG-INFO
copying HTSeq.egg-info/SOURCES.txt -> build/bdist.macosx-10.6-intel/egg/EGG-INFO
copying HTSeq.egg-info/dependency_links.txt -> build/bdist.macosx-10.6-intel/egg/EGG-INFO
copying HTSeq.egg-info/requires.txt -> build/bdist.macosx-10.6-intel/egg/EGG-INFO
copying HTSeq.egg-info/top_level.txt -> build/bdist.macosx-10.6-intel/egg/EGG-INFO
writing build/bdist.macosx-10.6-intel/egg/EGG-INFO/native_libs.txt
zip_safe flag not set; analyzing archive contents...
HTSeq.StepVector: module references __file__
creating dist
creating 'dist/HTSeq-0.8.0-py2.7-macosx-10.6-intel.egg' and adding 'build/bdist.macosx-10.6-intel/egg' to it
removing 'build/bdist.macosx-10.6-intel/egg' (and everything under it)
Processing HTSeq-0.8.0-py2.7-macosx-10.6-intel.egg
creating /Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/HTSeq-0.8.0-py2.7-macosx-10.6-intel.egg
Extracting HTSeq-0.8.0-py2.7-macosx-10.6-intel.egg to /Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages
Adding HTSeq 0.8.0 to easy-install.pth file
Installing htseq-count script to /Library/Frameworks/Python.framework/Versions/2.7/bin
Installing htseq-qa script to /Library/Frameworks/Python.framework/Versions/2.7/bin

Installed /Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/HTSeq-0.8.0-py2.7-macosx-10.6-intel.egg
Processing dependencies for HTSeq==0.8.0
Searching for pysam==0.11.2.2
Best match: pysam 0.11.2.2
Adding pysam 0.11.2.2 to easy-install.pth file

Using /Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages
Searching for numpy==1.13.1
Best match: numpy 1.13.1
Adding numpy 1.13.1 to easy-install.pth file

Using /Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages
Finished processing dependencies for HTSeq==0.8.0

これでどや

$ python local/bin/Subread_to_DEXSeq-master/dexseq_prepare_annotation2.py -f DEXSeq.gtf expression/Homo_sapiens/NCBI/build37.2/Annotation/Archives/archive-2014-06-02-13-47-29/Genes/genes.gtf genes.DEXSeq.gtf
Traceback (most recent call last):
  File "local/bin/Subread_to_DEXSeq-master/dexseq_prepare_annotation2.py", line 136, in <module>
    assert l[i].iv.end <= l[i+1].iv.start, str(l[i+1]) + " starts too early"
AssertionError: <GenomicFeature: exonic_part 'IL3RA' at Y: 1405508 -> 1405819 (strand '+')> starts too early

うーむ

gtfファイルとして今回iGenomeからダウンロードしたNCBIのbuild37.2を使用していたが、このプログラムはどうやらEnsemblのGRCh37を想定してスクリプトが書かれているとのこと。微妙な書式のブレかデータのエラーが引っかかるようです。

仕方がないのでEnsemblのデータをダウンロードし、展開
改めてやり直すと

$ python ~/local/bin/Subread_to_DEXSeq-master/dexseq_prepare_annotation2.py -f ./genesDEXSeq.gtf ~/expression/Homo_sapiens/Ensembl/GRCh37/Annotation/Archives/archive-2015-07-17-14-31-42/Genes/genes.gtf ./genesDEXSeq.gff
Done!

無事完了。
アプリごとにアノテーションファイルを独自で持ってないといけないとはなんとも非効率的な。

ではfeatureCountをおもむろにやり直し。

$ ~/local/bin/subread-1.5.2-MaxOSX-x86_64/bin/featureCounts -T 4 -p -s 2 -f -O -a genesDEXSeq.gtf -t exonic_part -o counts_PE.txt SRR1234567_th_PE.bam SRR7654321_th_PE.bam -F GTF

        ==========     _____ _    _ ____  _____  ______          _____  
        =====         / ____| |  | |  _ \|  __ \|  ____|   /\   |  __ \ 
          =====      | (___ | |  | | |_) | |__) | |__     /  \  | |  | |
            ====      \___ \| |  | |  _ <|  _  /|  __|   / /\ \ | |  | |
              ====    ____) | |__| | |_) | | \ \| |____ / ____ \| |__| |
        ==========   |_____/ \____/|____/|_|  \_\______/_/    \_\_____/
	  v1.5.2

//========================== featureCounts setting ===========================\\
||                                                                            ||
||             Input files : 2 BAM files                                      ||
||                           P SRR1234567_th_PE.bam                           ||
||                           P SRR7654321_th_PE.bam                           ||
||                                                                            ||
||             Output file : counts_PE.txt                                    ||
||                 Summary : counts_PE.txt.summary                            ||
||              Annotation : genesDEXSeq.gtf (GTF)                            ||
||      Dir for temp files : ./                                               ||
||                                                                            ||
||                 Threads : 4                                                ||
||                   Level : feature level                                    ||
||              Paired-end : yes                                              ||
||         Strand specific : reversely stranded                               ||
||      Multimapping reads : not counted                                      ||
|| Multi-overlapping reads : counted                                          ||
||   Min overlapping bases : 1                                                ||
||                                                                            ||
||          Chimeric reads : counted                                          ||
||        Both ends mapped : not required                                     ||
||                                                                            ||
\\===================== http://subread.sourceforge.net/ ======================//

//================================= Running ==================================\\
||                                                                            ||
|| Load annotation file genesDEXSeq.gtf ...                                   ||
||    Features : 644354                                                       ||
||    Meta-features : 60153                                                   ||
||    Chromosomes/contigs : 265                                               ||
||                                                                            ||
|| Process BAM file SRR1234567_th_PE.bam...                                   ||
||    Paired-end reads are included.                                          ||
||    Assign fragments (read pairs) to features...                            ||
||                                                                            ||
||    WARNING: reads from the same pair were found not adjacent to each       ||
||             other in the input (due to read sorting by location or         ||
||             reporting of multi-mapping read pairs).                        ||
||                                                                            ||
||    Read re-ordering is performed.                                          ||
||                                                                            ||
||    Both single-end and paired-end reads were found.                        ||
||    Total fragments : 15002128                                              ||
||    Successfully assigned fragments : 10670180 (71.1%)                      ||
||    Running time : 1.11 minutes                                             ||
||                                                                            ||
|| Process BAM file SRR7654321_th_PE.bam...                                   ||
||    Paired-end reads are included.                                          ||
||    Assign fragments (read pairs) to features...                            ||
||                                                                            ||
||    WARNING: reads from the same pair were found not adjacent to each       ||
||             other in the input (due to read sorting by location or         ||
||             reporting of multi-mapping read pairs).                        ||
||                                                                            ||
||    Read re-ordering is performed.                                          ||
||                                                                            ||
||    Total fragments : 16073300                                              ||
||    Successfully assigned fragments : 9769107 (60.8%)                       ||
||    Running time : 2.58 minutes                                             ||
||                                                                            ||
||                         Read assignment finished.                          ||
||                                                                            ||
|| Summary of counting results can be found in file "counts_PE.txt"           ||
||                                                                            ||
\\===================== http://subread.sourceforge.net/ ======================//

これでいいのだろうか?

ここからはR

> source("https://bioconductor.org/biocLite.R")
Bioconductor version 3.4 (BiocInstaller 1.24.0), ?biocLite for help
A new version of Bioconductor is available after installing the most
  recent version of R; see http://bioconductor.org/install

> biocLite("Rsubread")
BioC_mirror: https://bioconductor.org
Using Bioconductor 3.4 (BiocInstaller 1.24.0), R 3.3.3 (2017-03-06).
Installing package(s) ‘Rsubread’
 URL 'https://bioconductor.org/packages/3.4/bioc/bin/macosx/mavericks/contrib/3.3/Rsubread_1.24.2.tgz' を試しています 
Content type 'application/x-gzip' length 9320101 bytes (8.9 MB)
==================================================
downloaded 8.9 MB


The downloaded binary packages are in
	/var/folders/sf/3lk5x37135586w0jgqyjq_yc0000gn/T//RtmpyMsqr7/downloaded_packages

> source("~/local/bin/Subread_to_DEXSeq-master/load_SubreadOutput.R")
> suppressPackageStartupMessages({
+     require(dplyr)
+ })
 警告メッセージ: 
 library(package, lib.loc = lib.loc, character.only = TRUE, logical.return = TRUE,  で: 
   ‘dplyr’ という名前のパッケージはありません 

でた。

> install.packages("dplyr")
also installing the dependencies ‘bindr’, ‘bindrcpp’, ‘glue’

 URL 'https://cran.rstudio.com/bin/macosx/mavericks/contrib/3.3/bindr_0.1.tgz' を試しています 
Content type 'application/x-gzip' length 12121 bytes (11 KB)
==================================================
downloaded 11 KB

 URL 'https://cran.rstudio.com/bin/macosx/mavericks/contrib/3.3/bindrcpp_0.2.tgz' を試しています 
Content type 'application/x-gzip' length 380129 bytes (371 KB)
==================================================
downloaded 371 KB

 URL 'https://cran.rstudio.com/bin/macosx/mavericks/contrib/3.3/glue_1.1.1.tgz' を試しています 
Content type 'application/x-gzip' length 30626 bytes (29 KB)
==================================================
downloaded 29 KB

 URL 'https://cran.rstudio.com/bin/macosx/mavericks/contrib/3.3/dplyr_0.7.1.tgz' を試しています 
Content type 'application/x-gzip' length 5932799 bytes (5.7 MB)
==================================================
downloaded 5.7 MB


The downloaded binary packages are in
	/var/folders/sf/3lk5x37135586w0jgqyjq_yc0000gn/T//RtmpyMsqr7/downloaded_packages
> setwd("~/expression/BAMs/")

> samp <- data.frame(row.names = c("Patient_7","control_10"), 
+                    condition = rep(c("patient","control"),each=1))

> dxd.fc <- DEXSeqDataSetFromFeatureCounts("counts_PE.txt",
+                                          flattenedfile = "./genesDEXSeq.gtf",sampleData = samp)
Reading and adding Exon IDs for DEXSeq
 DEXSeqDataSetFromFeatureCounts("counts_PE.txt", flattenedfile = "./genesDEXSeq.gtf",  でエラー: 
  Count files do not correspond to the flattened annotation file
 追加情報:  警告メッセージ: 
0 aggregate geneIDs were found truncated in featureCounts output 

またまたエラー