PythonでGO解析 - kuroの覚え書き

これまでGO解析はDAVIDを使ってきた。特に理由はないが、まあ老舗だし。しかしインターフェースがあんまり使いやすいとは言い難く、植物のデータベースもシロイヌナズナにしか対応していない。ということでAgriGOとか他のサービスを利用してもいいのだけれど、せっかくなのでローカル（もしくは自前サーバ）で使えるツールを構築してしまおう、ということにした。

PythonでGO解析をするツールとしてはGOATOOLSくらいしか見つからなかったし、結構使えそうなので、まずはこちらを試してみる。

github.com

インストールはpython moduleなので

$ python3 -m pip install goatools

とするだけ。
ただし、これだけではだめで、GO解析に必要なGOデータを記述したOBOファイルというものがいる

wget http://geneontology.org/ontology/go-basic.obo
wget http://www.geneontology.org/ontology/subsets/goslim_generic.obo

go-basic.oboがあればとりあえず良くて、goslim_generic.oboを使うと選択されるGOが抑えられるらしい。詳細は未調査。
このOBOファイルはデフォルトではスクリプトを実行するカレントディレクトリに置いておく必要があるが、オプションでファイルを指定することも可能そうだ。

スクリプトもインストールされるので

$ find_enrichment.py
usage: 
python find_enrichment.py study.file population.file gene-association.file

This program returns P-values for functional enrichment in a cluster of study
genes using Fisher's exact test, and corrected for multiple testing (including
Bonferroni, Holm, Sidak, and false discovery rate).

About significance cutoff:
--alpha: test-wise alpha; for each GO term, what significance level to apply
        (most often you don't need to change this other than 0.05 or 0.01)
--pval: experiment-wise alpha; for the entire experiment, what significance
        level to apply after Bonferroni correction

       [-h] [--annofmt {gene2go,gaf,gpad,id2gos}] [--taxid TAXID] [--alpha ALPHA] [--pval PVAL] [--pval_field PVAL_FIELD]
       [--outfile OUTFILE] [--ns NS] [--id2sym ID2SYM] [--sections SECTIONS] [--outfile_detail OUTFILE_DETAIL] [--compare]
       [--ratio RATIO] [--prt_study_gos_only] [--indent] [--obo OBO] [--no_propagate_counts] [-r]
       [--relationships [RELATIONSHIPS ...]] [--method METHOD] [--pvalcalc {fisher_scipy_stats}] [--min_overlap MIN_OVERLAP]
       [--goslim GOSLIM] [--ev_inc EV_INC] [--ev_exc EV_EXC] [--ev_help] [--ev_help_short]
       filenames filenames filenames

positional arguments:
  filenames             data/study data/population data/association

optional arguments:
  -h, --help            show this help message and exit
  --annofmt {gene2go,gaf,gpad,id2gos}
                        Annotation file format. Not needed if type can be determined using filename (default: None)
  --taxid TAXID         When using NCBI's gene2go annotation file, specify desired taxid (default: 9606)
  --alpha ALPHA         Test-wise alpha for multiple testing (default: 0.05)
  --pval PVAL           Only print results with uncorrected p-value < PVAL. Print all results, significant and otherwise, by setting
                        --pval=1.0 (default: 0.05)
  --pval_field PVAL_FIELD
                        Only print results when PVAL_FIELD < PVAL. (default: None)
  --outfile OUTFILE     Write enrichment results into xlsx or tsv file (default: None)
  --ns NS               Limit GOEA to specified branch categories. BP=Biological Process; MF=Molecular Function; CC=Cellular
                        Component (default: BP,MF,CC)
  --id2sym ID2SYM       ASCII file containing one geneid and its symbol per line (default: None)
  --sections SECTIONS   Use sections file for printing grouped GOEA results. Example SECTIONS values:
                        goatools.test_data.sections.gjoneska_pfenning goatools/test_data/sections/gjoneska_pfenning.py
                        data/gjoneska_pfenning/sections_in.txt (default: None)
  --outfile_detail OUTFILE_DETAIL
                        Write enrichment results into a text file containing the following information: 1) GOEA GO terms, grouped
                        into sections 2) List of genes and ASCII art showing section membership 3) Detailed list of each gene and GO
                        terms w/their P-values (default: None)
  --compare             the population file as a comparison group. if this flag is specified, the population is used as the study
                        plus the `population/comparison` (default: False)
  --ratio RATIO         only show values where the difference between study and population ratios is greater than this. useful for
                        excluding GO categories with small differences, but containing large numbers of genes. should be a value
                        between 1 and 2. (default: None)
  --prt_study_gos_only  Print GO terms only if they are associated with study genes. This is useful if printng all GO results
                        regardless of their significance (--pval=1.0). (default: False)
  --indent              indent GO terms (default: False)
  --obo OBO             Specifies location and name of the obo file (default: go-basic.obo)
  --no_propagate_counts
                        Do not propagate counts to parent terms (default: False)
  -r, --relationship    Propagate counts up all relationships, (default: False)
  --relationships [RELATIONSHIPS ...]
                        Propagate counts up user-specified relationships, which include: part_of regulates negatively_regulates
                        positively_regulates (default: None)
  --method METHOD       Available methods: local( bonferroni sidak holm fdr ) statsmodels( sm_bonferroni sm_sidak holm_sidak sm_holm
                        simes_hochberg hommel fdr_bh fdr_by fdr_tsbh fdr_tsbky fdr_gbs ) (default: bonferroni,sidak,holm,fdr_bh)
  --pvalcalc {fisher_scipy_stats}
                        fisher_scipy_stats (default: fisher_scipy_stats)
  --min_overlap MIN_OVERLAP
                        Check that a minimum amount of study genes are in the population (default: 0.7)
  --goslim GOSLIM       The GO slim file is used when grouping GO terms. (default: goslim_generic.obo)
  --ev_inc EV_INC       Include specified evidence codes and groups separated by commas (default: None)
  --ev_exc EV_EXC       Exclude specified evidence codes and groups separated by commas (default: None)
  --ev_help             Print all Evidence codes, with descriptions (default: True)
  --ev_help_short       Print all Evidence codes (default: True)

このような感じ。
ちなみにスクリプトは

/home/linuxbrew/.linuxbrew/bin/find_enrichment.py

ここに入っている。
オプションが色々あるが基本的には

$ find_enrichment.py ./data/at_gene_list.txt ./data/at_all_gene_list.txt ./data/at_go_association.txt  --outfile=at_go_enrichment.tsv

がデフォルトセッティングで投げるフォーマットになる。
カレントディレクトリから一つ下がったdataディレクトリに入力する遺伝子のリストat_all_gene_list.txtと全遺伝子が一列に入っているat_all_gene_list.txtと遺伝子→GOの対応付けされたファイルat_go_association.txtを入れてある。
基本的に入力する遺伝子は全遺伝子リストに載っていないとだめで、全遺伝子リストに載っている遺伝子全部についてGOが対応付けされている必要があるらしい。
シロイヌナズナならまあ問題はないが情報の充実していない種ではすべての遺伝子にGOが割り振られているとは限らないので、まずはGOのついている遺伝子だけのリストを作成し、そのリストに載っている遺伝子だけを抽出して解析にかける必要がある。

import pandas as pd
df1 = pd.read_csv('/home/kuro/goatools/data/at_gene_list.txt', header=None)
df2 = pd.read_csv('/home/kuro/goatools/data/at_all_gene_list.txt', header=None)
df = pd.concat([df1,df2],axis=0)
df3 = df[df.duplicated()]
df3.to_csv('/home/kuro/goatools/data/at_gene_list1.txt', header=None, index=False)

このようなかんたんなPythonによる処理でGOのついている遺伝子だけを取り出してやる。

あとはhelpを見ながらオプションを色々いじってみる。p-valueを設定してリストするにしてもいろいろな統計方法のp-valueが選べるし、リストに表示する統計結果も増やしたり減らしたりできる。
デフォルトの統計方法はbonferroni,sidak,holm,fdr_bhの４つとなっており、まあデフォルトでほぼ問題はなさそうではある。
デフォルトではp_uncorrected<0.05となっているようで、ここをfdr_bhのp value <0.05にしてみたり、もっと数値を上げ下げしてみるといい場合があるだろう。

$ find_enrichment.py ./data/at_gene_list.txt ./data/at_all_gene_list.txt ./data/at_go_association.txt--pval=0.01 --method bonferroni,fdr_bh --pval_field fdr_bh --outfile=at_go_enrichment.tsv, at_go_enrichment.xlsx

みたいな感じだ。
ちなみに結果はtsv、xlsxをえらべ、両方を出力することも可能。結構柔軟だ。