kuroの覚え書き

96の個人的覚え書き

AlphaFold2のその後2.1

前回、どうもTensorflowのバージョンがなんかコンフリクトしているっぽく、自前で環境インストールしたのがまずかったのかも、と思ったので、一旦Dockerに戻ってみた。Docker自体にエラーの原因はないと思うし、Dockerの中ならバージョンが合わないということもないだろうと、

・・・・
I0827 15:11:45.201521 139638022334272 run_docker.py:193] I0827 06:11:45.200996 139677094938432 pipeline.py:207] Total number of templates (NB: this can include bad templates and is later filtered to top 4): 20.
I0827 15:11:45.240777 139638022334272 run_docker.py:193] I0827 06:11:45.240107 139677094938432 run_alphafold.py:142] Running model model_1
I0827 15:11:51.314033 139638022334272 run_docker.py:193] I0827 06:11:51.313426 139677094938432 model.py:132] Running predict with shape(feat) = {'aatype': (4, 177), 'residue_index': (4, 177), 'seq_length': (4,), 'template_aatype': (4, 4, 177), 'template_all_atom_masks': (4, 4, 177, 37), 'template_all_atom_positions': (4, 4, 177, 37, 3), 'template_sum_probs': (4, 4, 1), 'is_distillation': (4,), 'seq_mask': (4, 177), 'msa_mask': (4, 508, 177), 'msa_row_mask': (4, 508), 'random_crop_to_size_seed': (4, 2), 'template_mask': (4, 4), 'template_pseudo_beta': (4, 4, 177, 3), 'template_pseudo_beta_mask': (4, 4, 177), 'atom14_atom_exists': (4, 177, 14), 'residx_atom14_to_atom37': (4, 177, 14), 'residx_atom37_to_atom14': (4, 177, 37), 'atom37_atom_exists': (4, 177, 37), 'extra_msa': (4, 5120, 177), 'extra_msa_mask': (4, 5120, 177), 'extra_msa_row_mask': (4, 5120), 'bert_mask': (4, 508, 177), 'true_msa': (4, 508, 177), 'extra_has_deletion': (4, 5120, 177), 'extra_deletion_value': (4, 5120, 177), 'msa_feat': (4, 508, 177, 49), 'target_feat': (4, 177, 22)}
I0827 15:11:52.923670 139638022334272 run_docker.py:193] 2021-08-27 06:11:52.922207: W external/org_tensorflow/tensorflow/stream_executor/gpu/asm_compiler.cc:235] Falling back to the CUDA driver for PTX compilation; ptxas does not support CC 8.6
I0827 15:11:52.923956 139638022334272 run_docker.py:193] 2021-08-27 06:11:52.922279: W external/org_tensorflow/tensorflow/stream_executor/gpu/asm_compiler.cc:238] Used ptxas at ptxas
I0827 15:11:52.979863 139638022334272 run_docker.py:193] 2021-08-27 06:11:52.979194: E external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_driver.cc:625] failed to get PTX kernel "shift_right_logical_3" from module: CUDA_ERROR_NOT_FOUND: named symbol not found
I0827 15:11:52.990681 139638022334272 run_docker.py:193] 2021-08-27 06:11:52.990152: E external/org_tensorflow/tensorflow/compiler/xla/pjrt/pjrt_stream_executor_client.cc:2040] Execution of replica 0 failed: Internal: Could not find the corresponding function
I0827 15:11:53.107557 139638022334272 run_docker.py:193] Traceback (most recent call last):
I0827 15:11:53.107832 139638022334272 run_docker.py:193] File "/app/alphafold/run_alphafold.py", line 310, in <module>
I0827 15:11:53.108034 139638022334272 run_docker.py:193] app.run(main)
I0827 15:11:53.108211 139638022334272 run_docker.py:193] File "/opt/conda/lib/python3.7/site-packages/absl/app.py", line 312, in run
I0827 15:11:53.108422 139638022334272 run_docker.py:193] _run_main(main, args)
I0827 15:11:53.108574 139638022334272 run_docker.py:193] File "/opt/conda/lib/python3.7/site-packages/absl/app.py", line 258, in _run_main
I0827 15:11:53.108716 139638022334272 run_docker.py:193] sys.exit(main(argv))
I0827 15:11:53.108857 139638022334272 run_docker.py:193] File "/app/alphafold/run_alphafold.py", line 292, in main
I0827 15:11:53.109008 139638022334272 run_docker.py:193] random_seed=random_seed)
I0827 15:11:53.109150 139638022334272 run_docker.py:193] File "/app/alphafold/run_alphafold.py", line 149, in predict_structure
I0827 15:11:53.109290 139638022334272 run_docker.py:193] prediction_result = model_runner.predict(processed_feature_dict)
I0827 15:11:53.109431 139638022334272 run_docker.py:193] File "/app/alphafold/alphafold/model/model.py", line 133, in predict
I0827 15:11:53.109569 139638022334272 run_docker.py:193] result = self.apply(self.params, jax.random.PRNGKey(0), feat)
I0827 15:11:53.109706 139638022334272 run_docker.py:193] File "/opt/conda/lib/python3.7/site-packages/jax/_src/random.py", line 75, in PRNGKey
I0827 15:11:53.109843 139638022334272 run_docker.py:193] k1 = convert(lax.shift_right_logical(seed_arr, lax._const(seed_arr, 32)))
I0827 15:11:53.109997 139638022334272 run_docker.py:193] File "/opt/conda/lib/python3.7/site-packages/jax/_src/lax/lax.py", line 386, in shift_right_logical
I0827 15:11:53.110138 139638022334272 run_docker.py:193] return shift_right_logical_p.bind(x, y)
I0827 15:11:53.110274 139638022334272 run_docker.py:193] File "/opt/conda/lib/python3.7/site-packages/jax/core.py", line 265, in bind
I0827 15:11:53.110411 139638022334272 run_docker.py:193] out = top_trace.process_primitive(self, tracers, params)
I0827 15:11:53.110548 139638022334272 run_docker.py:193] File "/opt/conda/lib/python3.7/site-packages/jax/core.py", line 610, in process_primitive
I0827 15:11:53.110686 139638022334272 run_docker.py:193] return primitive.impl(*tracers, **params)
I0827 15:11:53.110825 139638022334272 run_docker.py:193] File "/opt/conda/lib/python3.7/site-packages/jax/interpreters/xla.py", line 274, in apply_primitive
I0827 15:11:53.110984 139638022334272 run_docker.py:193] return compiled_fun(*args)
I0827 15:11:53.111125 139638022334272 run_docker.py:193] File "/opt/conda/lib/python3.7/site-packages/jax/interpreters/xla.py", line 390, in _execute_compiled_primitive
I0827 15:11:53.111263 139638022334272 run_docker.py:193] out_bufs = compiled.execute(input_bufs)
I0827 15:11:53.111400 139638022334272 run_docker.py:193] RuntimeError: Internal: Could not find the corresponding function


だめだ〜
ただ、なんか言っていることはさっきとは違うみたいなんで、そのへんを詳しく分解していくしかあるまい。
ちなみに今回、アミノ酸サイズが小さいタンパク質を投げてみたが、かかった時間はやはり6時間位で殆ど変わらない。
これはどうもタンパク質の長さが問題ではなくて比較対象となるデータベースの読み込みに時間がかかっているのだろうか。
だからこそのSSD推奨なんだろうな。3TBのSSDとか幾らするんだ?1.5Tをストライピングするにしても。

そこまでするならインチキサーバじゃなくて1台組み上げたほうが早いような・・・
マザボ(ある程度枯れたシステムにしたほうがたぶんいい)
CPU(AMDに浮気するか?)
メモリ(128Gくらい?)
SSD (3TB)
ケース(古いminiATXケース流用でもいいか)
があれば完璧。

あとは、エラーが出る前までのところをスキップしてできるようにスクリプトをどうにかするか。

AlphaFold2のその後2

何が問題なのか一つわかった。

WARNING: Ignoring invalid symbol '*' at pos. 492 in line 2 of /tmp/tmp2hwa2yuo/query.a3m

これだ。
なんとなく自分の持っているアミノ酸データでFASTAファイルを自分で作って投げていたわけだが
ストップコドンのところのアミノ酸を習慣的にアスタリスクにしてあった。
それがお気に召さなかったようで、アスタリスクを消したFASTAファイルを作ったら前回エラーが出たポイントをクリアした。

ERRORじゃなくてWARNINGだったのでスルーしてたわ。

ということで気を取り直して
ついでに一旦Docker上の環境をやめて直にインストールでやってみる。GPUも手に入ったのでそれも使って

www.af2anatomia.jp

/home/kuro/miniconda3/envs/alphafold/lib/python3.8/site-packages/absl/flags/_validators.py:203: UserWarning: Flag --preset has a non-None default value; therefore, mark_flag_as_required will pass even if flag is not specified in the command line!
  warnings.warn(
I0826 20:34:58.412197 140720169187136 templates.py:836] Using precomputed obsolete pdbs /mnt/HDD2/af_database//pdb_mmcif/obsolete.dat.
I0826 20:34:59.711968 140720169187136 xla_bridge.py:236] Unable to initialize backend 'tpu_driver': Not found: Unable to find driver in registry given worker: 
I0826 20:34:59.772896 140720169187136 xla_bridge.py:236] Unable to initialize backend 'tpu': Invalid argument: TpuPlatform is not available.
I0826 20:35:00.382656 140720169187136 run_alphafold.py:267] Have 1 models: ['model_1']
I0826 20:35:00.382773 140720169187136 run_alphafold.py:280] Using random seed 8088228927505236060 for the data pipeline
I0826 20:35:00.383199 140720169187136 jackhmmer.py:130] Launching subprocess "/home/kuro/miniconda3/envs/alphafold/bin/jackhmmer -o /dev/null -A /tmp/tmpwx4x1v4e/output.sto --noali --F1 0.0005 --F2 5e-05 --F3 5e-07 --incE 0.0001 -E 0.0001 --cpu 8 -N 1 /home/kuro/test5.fasta /mnt/HDD2/af_database//uniref90/uniref90.fasta"
I0826 20:35:00.406429 140720169187136 utils.py:36] Started Jackhmmer (uniref90.fasta) query
I0826 20:40:26.839800 140720169187136 utils.py:40] Finished Jackhmmer (uniref90.fasta) query in 326.433 seconds
I0826 20:40:26.901696 140720169187136 jackhmmer.py:130] Launching subprocess "/home/kuro/miniconda3/envs/alphafold/bin/jackhmmer -o /dev/null -A /tmp/tmp86y_pxlt/output.sto --noali --F1 0.0005 --F2 5e-05 --F3 5e-07 --incE 0.0001 -E 0.0001 --cpu 8 -N 1 /home/kuro/test5.fasta /mnt/HDD2/af_database//mgnify/mgy_clusters.fa"
I0826 20:40:26.917324 140720169187136 utils.py:36] Started Jackhmmer (mgy_clusters.fa) query
I0826 20:45:57.057168 140720169187136 utils.py:40] Finished Jackhmmer (mgy_clusters.fa) query in 330.140 seconds
I0826 20:45:58.580264 140720169187136 hhsearch.py:76] Launching subprocess "/home/kuro/miniconda3/envs/alphafold/bin/hhsearch -i /tmp/tmpkl7ayr4a/query.a3m -o /tmp/tmpkl7ayr4a/output.hhr -maxseq 1000000 -d /mnt/HDD2/af_database//pdb70/pdb70"
I0826 20:45:58.635203 140720169187136 utils.py:36] Started HHsearch query
I0826 20:50:30.194164 140720169187136 utils.py:40] Finished HHsearch query in 271.558 seconds
I0826 20:50:33.453487 140720169187136 hhblits.py:128] Launching subprocess "/home/kuro/miniconda3/envs/alphafold/bin/hhblits -i /home/kuro/test5.fasta -cpu 4 -oa3m /tmp/tmpkobbonvm/output.a3m -o /dev/null -n 3 -e 0.001 -maxseq 1000000 -realign_max 100000 -maxfilt 100000 -min_prefilter_hits 1000 -d /mnt/HDD2/af_database//bfd/bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt -d /mnt/HDD2/af_database//uniclust30/uniclust30_2018_08/uniclust30_2018_08"
I0826 20:50:33.492181 140720169187136 utils.py:36] Started HHblits query
I0827 02:29:46.357933 140720169187136 utils.py:40] Finished HHblits query in 20352.865 seconds

このへんまでは順調。
て、ここまでで6時間位かかってるけど。このへんはCPU勝負らしいのでもうちょっとCPUスペックをあげるべきなのかも。

んでもってこのあとテンプレートを探す作業が始まり

I0827 02:29:46.503336 140720169187136 templates.py:848] Searching for template for: None
W0827 02:29:46.504062 140720169187136 templates.py:131] Template structure not in release dates dict: 1g87
I0827 02:29:46.504361 140720169187136 templates.py:715] Reading PDB entry from /mnt/HDD2/af_database//pdb_mmcif/mmcif_files/1g87.cif. Query: MAFR・・・・
I0827 02:29:56.245360 140720169187136 templates.py:270] Found an exact template match 3rx5_A.
I0827 02:29:57.180096 140720169187136 pipeline.py:200] Uniref90 MSA size: 8133 sequences.
I0827 02:29:57.180236 140720169187136 pipeline.py:201] BFD MSA size: 2683 sequences.
I0827 02:29:57.180323 140720169187136 pipeline.py:202] MGnify MSA size: 501 sequences.
I0827 02:29:57.180396 140720169187136 pipeline.py:203] Final (deduplicated) MSA size: 11174 sequences.
I0827 02:29:57.180535 140720169187136 pipeline.py:205] Total number of templates (NB: this can include bad templates and is later filtered to top 4): 20.
I0827 02:29:58.202671 140720169187136 run_alphafold.py:142] Running model model_1
I0827 02:30:05.643637 140720169187136 model.py:131] Running predict with shape(feat) = {'aatype': (4, 492), 'residue_index': (4, 492), 'seq_length': (4,), 'template_aatype': (4, 4, 492), 'template_all_atom_masks': (4, 4, 492, 37), 'template_all_atom_positions': (4, 4, 492, 37, 3), 'template_sum_probs': (4, 4, 1), 'is_distillation': (4,), 'seq_mask': (4, 492), 'msa_mask': (4, 508, 492), 'msa_row_mask': (4, 508), 'random_crop_to_size_seed': (4, 2), 'template_mask': (4, 4), 'template_pseudo_beta': (4, 4, 492, 3), 'template_pseudo_beta_mask': (4, 4, 492), 'atom14_atom_exists': (4, 492, 14), 'residx_atom14_to_atom37': (4, 492, 14), 'residx_atom37_to_atom14': (4, 492, 37), 'atom37_atom_exists': (4, 492, 37), 'extra_msa': (4, 5120, 492), 'extra_msa_mask': (4, 5120, 492), 'extra_msa_row_mask': (4, 5120), 'bert_mask': (4, 508, 492), 'true_msa': (4, 508, 492), 'extra_has_deletion': (4, 5120, 492), 'extra_deletion_value': (4, 5120, 492), 'msa_feat': (4, 508, 492, 49), 'target_feat': (4, 492, 22)}
2021-08-27 02:30:06.870769: W external/org_tensorflow/tensorflow/stream_executor/gpu/asm_compiler.cc:81] Couldn't get ptxas version string: Internal: Running ptxas --version returned 32512
2021-08-27 02:30:06.893457: F external/org_tensorflow/tensorflow/compiler/xla/service/gpu/nvptx_compiler.cc:479] ptxas returned an error during compilation of ptx to sass: 'Internal: ptxas exited with non-zero error code 32512, output: '  If the error message indicates that a file could not be written, please verify that sufficient filesystem space is provided.
Fatal Python error: Aborted

Thread 0x00007ffbf7b27740 (most recent call first):
  File "/home/kuro/miniconda3/envs/alphafold/lib/python3.8/site-packages/jax/interpreters/xla.py", line 385 in backend_compile
  File "/home/kuro/miniconda3/envs/alphafold/lib/python3.8/site-packages/jax/interpreters/xla.py", line 322 in xla_primitive_callable
  File "/home/kuro/miniconda3/envs/alphafold/lib/python3.8/site-packages/jax/_src/util.py", line 179 in cached
  File "/home/kuro/miniconda3/envs/alphafold/lib/python3.8/site-packages/jax/_src/util.py", line 186 in wrapper
  File "/home/kuro/miniconda3/envs/alphafold/lib/python3.8/site-packages/jax/interpreters/xla.py", line 273 in apply_primitive
  File "/home/kuro/miniconda3/envs/alphafold/lib/python3.8/site-packages/jax/core.py", line 610 in process_primitive
  File "/home/kuro/miniconda3/envs/alphafold/lib/python3.8/site-packages/jax/core.py", line 265 in bind
  File "/home/kuro/miniconda3/envs/alphafold/lib/python3.8/site-packages/jax/_src/lax/lax.py", line 386 in shift_right_logical
  File "/home/kuro/miniconda3/envs/alphafold/lib/python3.8/site-packages/jax/_src/random.py", line 75 in PRNGKey
  File "/home/kuro/alphafold/alphafold/model/model.py", line 133 in predict
  File "/home/kuro/alphafold/run_alphafold.py", line 149 in predict_structure
  File "/home/kuro/alphafold/run_alphafold.py", line 284 in main
  File "/home/kuro/miniconda3/envs/alphafold/lib/python3.8/site-packages/absl/app.py", line 258 in _run_main
  File "/home/kuro/miniconda3/envs/alphafold/lib/python3.8/site-packages/absl/app.py", line 312 in run
  File "/home/kuro/alphafold/run_alphafold.py", line 310 in <module>

うわあ〜最後の最後でまたエラーが出て止まってやがる。なんかTensorflowのバージョン不整合っぽいな。
もうちょっとか。
しかし、6時間かけてエラーは辛すぎ。
もうちょっと小さい分子でテストすべきだな。

Fujitsu Primergy TX1310M3をDeep Learningマシンに仕立てる

手持ち機材をやりくりして深層学習マシンを仕立てようということで、一番遊んでいるTX1310M3に白羽の矢を立てた。
一応16x のPCI Express 3.0スロットがあるし、CPUもXeon E3-1230v6に換装し、64GBメモリ、8TB+1TBのRAIDも組めるし。
ってことでGPUを調達。

しかしこのTX1310M3はちょっと電源周りが特殊で、デスクトップなのにATX電源を使っていない。
HDDなどの電源もマザーボードからコードをはやしているうえ、電源の容量が250Wしかない。
というわけで、補助電源なしのGTX1050くらいが限界なのだけど、そんなGPU今更ですよね。

ってわけで行っときました。RTX3060です。

f:id:k-kuro:20210825204023j:plain

電源は内蔵電源を諦め、ATX電源を別に追加して電力供給です。

f:id:k-kuro:20210825204041j:plain

蓋が閉まらないため半開きで運用w

本当はRTX3090とか行っときたいところですが、物理的に筐体に入るモデルがない。
3060でも2xファンのかなり小さめのモデルを探して入れました。200mmならどうにかって感じ。

まあ、電源を外に出している時点で筐に入らなきゃグラボも外に引っ張り出して〜とか考えなくもなかったですが
そこまでやるならマイニングマシンのようなバラック仕立てにしたほうが・・・とかきりがないので。

ちなみに電源は

こいつ。本当は余分なケーブルを抜いておける電源にしたつもりだったけど、間違えた。ケーブルごちゃごちゃで全くスマートじゃないわ。

ATX電源なので、コネクタの端子にちょいっと短絡を入れておいてサーバの電源を入れる前にスイッチを入れておくことで対応。
一応電源連動用の基板も購入したが、まだ届いていない。

8xのRAIDカードを1xに刺すとダメダメになるのか検証

Haswellなxeonの中古サーバを試験用に保持しているのだが、PCIeスロットが1x/1x/16x/4xと結構しょぼい。
4xにN8103-150というRAIDカードが刺さっており、こちらから2.5inch 450GB SAS HDDが3枚RAID5構成で接続されている。
しかしこのカードがここにいるとせっかくのPCIe3.0 16xスロットに2スロットなグラボがさせない。
かといってRAIDカードをやめてオンボードからSATAディスクに接続するとなるとこの4xスロットの並びにSATA用のminiSASコネクタが垂直に生えていて、結局大きめなグラボはぶつかってしまうため、搭載できない。
ということでRAIDカードを1xスロットに刺したら使い物にならないのか?をテストしてみることにした。

OSはCentOS7で、CPUはxeon1320v3、メモリは8GB搭載している。

まずは定位置の4xスロットで起動。
hdparmで読み出し、ddで書き込みをテストするという簡易測定を実施。

[kuro@E5800-T110f-E ~]$ sudo hdparm -t /dev/sda1
[sudo] kuro のパスワード:

/dev/sda1:
 Timing buffered disk reads: 598 MB in  3.01 seconds = 198.72 MB/sec

 Timing buffered disk reads: 618 MB in  3.00 seconds = 205.97 MB/sec

 Timing buffered disk reads: 622 MB in  3.01 seconds = 206.76 MB/sec

 Timing buffered disk reads: 612 MB in  3.00 seconds = 203.91 MB/sec

 Timing buffered disk reads: 554 MB in  3.01 seconds = 184.29 MB/sec

[kuro@E5800-T110f-E ~]$ sudo time dd if=/dev/zero of=/tmp/hdparm_write.tmp ibs=1M obs=1M count=1024
1024+0 レコード入力
1024+0 レコード出力
1073741824 バイト (1.1 GB) コピーされました、 0.363617 秒、 3.0 GB/秒
0.07user 0.29system 0:00.39elapsed 91%CPU (0avgtext+0avgdata 2972maxresident)k
152inputs+2097152outputs (1major+789minor)pagefaults 0swaps

1073741824 バイト (1.1 GB) コピーされました、  0.360279 秒、 3.0 GB/秒
0.07user 0.42system 0:00.49elapsed 99%CPU (0avgtext+0avgdata 2968maxresident)k
0inputs+2097152outputs (0major+789minor)pagefaults 0swaps

1073741824 バイト (1.1 GB) コピーされました、  0.360702 秒、 3.0 GB/秒
0.07user 0.43system 0:00.51elapsed 99%CPU (0avgtext+0avgdata 2972maxresident)k
0inputs+2097152outputs (0major+791minor)pagefaults 0swaps

1073741824 バイト (1.1 GB) コピーされました、  0.364864 秒、 2.9 GB/秒
0.06user 0.41system 0:00.48elapsed 99%CPU (0avgtext+0avgdata 2968maxresident)k
0inputs+2097152outputs (0major+789minor)pagefaults 0swaps

1073741824 バイト (1.1 GB) コピーされました、  1.07666秒、 997 MB/秒
0.05user 0.44system 0:01.36elapsed 36%CPU (0avgtext+0avgdata 2972maxresident)k
0inputs+2097152outputs (0major+791minor)pagefaults 0swaps

1073741824  バイト (1.1 GB) コピーされました、  0.365009 秒、 2.9 GB/秒
0.07user 0.41system 0:00.48elapsed 99%CPU (0avgtext+0avgdata 2976maxresident)k
0inputs+2097152outputs (0major+792minor)pagefaults 0swaps

これくらい
一方、1xスロットに移動して起動すると、

[kuro@E5800-T110f-E ~]$ sudo hdparm -t /dev/sda1
[sudo] kuro のパスワード:

/dev/sda1:
 Timing buffered disk reads: 620 MB in  3.00 seconds = 206.60 MB/sec

 Timing buffered disk reads: 622 MB in  3.00 seconds = 207.02 MB/sec

 Timing buffered disk reads: 620 MB in  3.00 seconds = 206.58 MB/sec

 Timing buffered disk reads: 634 MB in  3.00 seconds = 211.00 MB/sec

 Timing buffered disk reads: 620 MB in  3.00 seconds = 206.53 MB/sec

[kkuro@E5800-T110f-E ~]$ sudo hdparm -t /dev/sda1

/dev/sda1:
 Timing buffered disk reads: 632 MB in  3.01 seconds = 210.27 MB/sec

[kuro@E5800-T110f-E ~]$ sudo time dd if=/dev/zero of=/tmp/hdparm_write.tmp ibs=1M obs=1M count=1024
1024+0 レコード入力
1024+0 レコード出力
1073741824 バイト (1.1 GB) コピーされました、 0.458587 秒、 2.3 GB/秒
0.08user 0.38system 0:00.52elapsed 87%CPU (0avgtext+0avgdata 2976maxresident)k
152inputs+2097152outputs (1major+791minor)pagefaults 0swaps

1073741824 バイト (1.1 GB) コピーされました、 0.352292 秒、 3.0 GB/秒
0.06user 0.44system 0:00.51elapsed 99%CPU (0avgtext+0avgdata 2972maxresident)k
0inputs+2097152outputs (0major+790minor)pagefaults 0swaps

1073741824 バイト (1.1 GB) コピーされました、 0.352826 秒、 3.0 GB/秒
0.05user 0.43system 0:00.49elapsed 99%CPU (0avgtext+0avgdata 2972maxresident)k
0inputs+2097152outputs (0major+791minor)pagefaults 0swaps

1073741824 バイト (1.1 GB) コピーされました、 0.353325 秒、 3.0 GB/秒
0.06user 0.42system 0:00.48elapsed 100%CPU (0avgtext+0avgdata 2972maxresident)k
0inputs+2097152outputs (0major+791minor)pagefaults 0swaps

1073741824 バイト (1.1 GB) コピーされました、 0.353314 秒、 3.0 GB/秒
0.06user 0.42system 0:00.49elapsed 99%CPU (0avgtext+0avgdata 2972maxresident)k
0inputs+2097152outputs (0major+791minor)pagefaults 0swaps

こんな感じで、1GBの単独ファイルの読み書きでは全然差がないことが解る。ランダムアクセスしたり大量のファイルを書き込んだりすると差が出るのかいなあ。

そもそもカードスペックは
PCI-Express Specification Revision 2.0
x8 PCI Express bus operating at 5 Gb/s or 2.5 Gb/s serial transfer rate
となっているのだが、1xスロットでもそれくらいは出るってことかね?
そもそもすでに4xに挿している時点で本来の規格よりも低いのだが。

もうちょっとちゃんとしたベンチマークソフトで比較したほうがいいかもな。

fioというのを使ってみる。
fioを使ってストレージの性能を計測してみた - Qiita
こちらを参考に。
ソフト自体はすでにインストール済みだ。

[kuro@E5800-T110f-E ~]$ fio -filename=/tmp/test2g -direct=1 -rw=write -bs=4k -size=2G -numjobs=64 -runtime=10 -group_reporting -name=file1
file1: (g=0): rw=write, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=psync, iodepth=1
...
fio-3.7
Starting 64 processes
file1: Laying out IO file (1 file / 2048MiB)
Jobs: 64 (f=64): [W(64)][30.0%][r=0KiB/s,w=369MiB/s][r=0,w=94.6k IOPS][eta 00m:0Jobs: 64 (f=64): [W(64)][40.0%][r=0KiB/s,w=369MiB/s][r=0,w=94.4k IOPS][eta 00m:0Jobs: 64 (f=64): [W(64)][50.0%][r=0KiB/s,w=370MiB/s][r=0,w=94.6k IOPS][eta 00m:0Jobs: 64 (f=64): [W(64)][60.0%][r=0KiB/s,w=369MiB/s][r=0,w=94.6k IOPS][eta 00m:0Jobs: 64 (f=64): [W(64)][70.0%][r=0KiB/s,w=370MiB/s][r=0,w=94.6k IOPS][eta 00m:0Jobs: 64 (f=64): [W(64)][80.0%][r=0KiB/s,w=369MiB/s][r=0,w=94.6k IOPS][eta 00m:0Jobs: 64 (f=64): [W(64)][90.0%][r=0KiB/s,w=366MiB/s][r=0,w=93.7k IOPS][eta 00m:0Jobs: 64 (f=64): [W(64)][100.0%][r=0KiB/s,w=370MiB/s][r=0,w=94.7k IOPS][eta 00m:00s]
file1: (groupid=0, jobs=64): err= 0: pid=2151: Mon Aug 16 19:59:39 2021
  write: IOPS=94.3k, BW=368MiB/s (386MB/s)(3684MiB/10001msec)
    clat (usec): min=61, max=11002, avg=676.60, stdev=93.70
     lat (usec): min=61, max=11002, avg=676.86, stdev=93.70
    clat percentiles (usec):
     |  1.00th=[  660],  5.00th=[  668], 10.00th=[  668], 20.00th=[  668],
     | 30.00th=[  676], 40.00th=[  676], 50.00th=[  676], 60.00th=[  676],
     | 70.00th=[  676], 80.00th=[  676], 90.00th=[  685], 95.00th=[  685],
     | 99.00th=[  693], 99.50th=[  693], 99.90th=[ 1221], 99.95th=[ 1467],
     | 99.99th=[ 2606]
   bw (  KiB/s): min= 5632, max= 5960, per=1.56%, avg=5891.27, stdev=55.58, samples=1230
   iops        : min= 1408, max= 1490, avg=1472.82, stdev=13.89, samples=1230
  lat (usec)   : 100=0.01%, 250=0.01%, 500=0.05%, 750=99.62%, 1000=0.12%
  lat (msec)   : 2=0.18%, 4=0.01%, 10=0.01%, 20=0.01%
  cpu          : usr=0.49%, sys=2.53%, ctx=978109, majf=0, minf=2200
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,942994,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
  WRITE: bw=368MiB/s (386MB/s), 368MiB/s-368MiB/s (386MB/s-386MB/s), io=3684MiB (3863MB), run=10001-10001msec

Disk stats (read/write):
    dm-0: ios=0/930312, merge=0/0, ticks=0/613785, in_queue=617670, util=99.15%, aggrios=0/942996, aggrmerge=0/0, aggrticks=0/622132, aggrin_queue=622984, aggrutil=98.67%
  sda: ios=0/942996, merge=0/0, ticks=0/622132, in_queue=622984, util=98.67%

[kuro@E5800-T110f-E ~]$ fio -filename=/tmp/test2g -direct=1 -rw=read -bs=4k -size=2G -numjobs=64 -runtime=10 -group_reporting -name=file1
file1: (g=0): rw=read, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=psync, iodepth=1
...
fio-3.7
Starting 64 processes
Jobs: 64 (f=64): [R(64)][100.0%][r=370MiB/s,w=0KiB/s][r=94.8k,w=0 IOPS][eta 00m:00s]
file1: (groupid=0, jobs=64): err= 0: pid=2326: Mon Aug 16 20:08:43 2021
   read: IOPS=101k, BW=393MiB/s (412MB/s)(3932MiB/10013msec)
    clat (nsec): min=964, max=24291k, avg=634621.44, stdev=231967.90
     lat (nsec): min=984, max=24291k, avg=634806.78, stdev=231996.96
    clat percentiles (nsec):
     |  1.00th=[   1080],  5.00th=[   1160], 10.00th=[ 651264],
     | 20.00th=[ 659456], 30.00th=[ 667648], 40.00th=[ 667648],
     | 50.00th=[ 675840], 60.00th=[ 675840], 70.00th=[ 684032],
     | 80.00th=[ 684032], 90.00th=[ 692224], 95.00th=[ 692224],
     | 99.00th=[ 700416], 99.50th=[ 708608], 99.90th=[1253376],
     | 99.95th=[1351680], 99.99th=[7372800]
   bw (  KiB/s): min= 5592, max=28552, per=1.55%, avg=6244.59, stdev=2411.70, samples=1275
   iops        : min= 1398, max= 7138, avg=1561.15, stdev=602.93, samples=1275
  lat (nsec)   : 1000=0.03%
  lat (usec)   : 2=6.21%, 4=0.01%, 10=0.01%, 20=0.01%, 50=0.01%
  lat (usec)   : 100=0.01%, 250=0.02%, 500=0.07%, 750=93.34%, 1000=0.15%
  lat (msec)   : 2=0.14%, 4=0.01%, 10=0.01%, 20=0.01%, 50=0.01%
  cpu          : usr=0.45%, sys=2.37%, ctx=944329, majf=0, minf=2418
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=1006564,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
   READ: bw=393MiB/s (412MB/s), 393MiB/s-393MiB/s (412MB/s-412MB/s), io=3932MiB (4123MB), run=10013-10013msec

Disk stats (read/write):
    dm-0: ios=932627/2, merge=0/0, ticks=617264/0, in_queue=619520, util=99.19%, aggrios=943587/2, aggrmerge=0/0, aggrticks=624350/0, aggrin_queue=624896, aggrutil=98.60%
  sda: ios=943587/2, merge=0/0, ticks=624350/0, in_queue=624896, util=98.60%

[kuro@E5800-T110f-E ~]$ fio -filename=/tmp/test2g -direct=1 -rw=read -bs=4k -size=2G -numjobs=64 -runtime=10 -group_reporting -name=file1
file1: (g=0): rw=read, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=psync, iodepth=1
...
fio-3.7
Starting 64 processes
Jobs: 64 (f=64): [R(64)][100.0%][r=370MiB/s,w=0KiB/s][r=94.8k,w=0 IOPS][eta 00m:00s]
file1: (groupid=0, jobs=64): err= 0: pid=2326: Mon Aug 16 20:08:43 2021
   read: IOPS=101k, BW=393MiB/s (412MB/s)(3932MiB/10013msec)
    clat (nsec): min=964, max=24291k, avg=634621.44, stdev=231967.90
     lat (nsec): min=984, max=24291k, avg=634806.78, stdev=231996.96
    clat percentiles (nsec):
     |  1.00th=[   1080],  5.00th=[   1160], 10.00th=[ 651264],
     | 20.00th=[ 659456], 30.00th=[ 667648], 40.00th=[ 667648],
     | 50.00th=[ 675840], 60.00th=[ 675840], 70.00th=[ 684032],
     | 80.00th=[ 684032], 90.00th=[ 692224], 95.00th=[ 692224],
     | 99.00th=[ 700416], 99.50th=[ 708608], 99.90th=[1253376],
     | 99.95th=[1351680], 99.99th=[7372800]
   bw (  KiB/s): min= 5592, max=28552, per=1.55%, avg=6244.59, stdev=2411.70, samples=1275
   iops        : min= 1398, max= 7138, avg=1561.15, stdev=602.93, samples=1275
  lat (nsec)   : 1000=0.03%
  lat (usec)   : 2=6.21%, 4=0.01%, 10=0.01%, 20=0.01%, 50=0.01%
  lat (usec)   : 100=0.01%, 250=0.02%, 500=0.07%, 750=93.34%, 1000=0.15%
  lat (msec)   : 2=0.14%, 4=0.01%, 10=0.01%, 20=0.01%, 50=0.01%
  cpu          : usr=0.45%, sys=2.37%, ctx=944329, majf=0, minf=2418
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=1006564,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
   READ: bw=393MiB/s (412MB/s), 393MiB/s-393MiB/s (412MB/s-412MB/s), io=3932MiB (4123MB), run=10013-10013msec

Disk stats (read/write):
    dm-0: ios=932627/2, merge=0/0, ticks=617264/0, in_queue=619520, util=99.19%, aggrios=943587/2, aggrmerge=0/0, aggrticks=624350/0, aggrin_queue=624896, aggrutil=98.60%
  sda: ios=943587/2, merge=0/0, ticks=624350/0, in_queue=624896, util=98.60%

[kuro@E5800-T110f-E ~]$ fio -filename=/tmp/test2g -direct=1 -rw=randwrite -bs=4k -size=2G -numjobs=64 -runtime=10 -group_reporting -name=file1
file1: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=psync, iodepth=1
...
fio-3.7
Starting 64 processes
Jobs: 64 (f=64): [w(64)][100.0%][r=0KiB/s,w=1425KiB/s][r=0,w=356 IOPS][eta 00m:00s]
file1: (groupid=0, jobs=64): err= 0: pid=2413: Mon Aug 16 20:09:57 2021
  write: IOPS=803, BW=3213KiB/s (3290kB/s)(31.9MiB/10175msec)
    clat (usec): min=112, max=1163.6k, avg=79013.12, stdev=90407.13
     lat (usec): min=113, max=1163.6k, avg=79013.47, stdev=90407.06
    clat percentiles (usec):
     |  1.00th=[    519],  5.00th=[    865], 10.00th=[   1037],
     | 20.00th=[   1319], 30.00th=[   1598], 40.00th=[   2008],
     | 50.00th=[   2999], 60.00th=[ 149947], 70.00th=[ 166724],
     | 80.00th=[ 179307], 90.00th=[ 193987], 95.00th=[ 206570],
     | 99.00th=[ 227541], 99.50th=[ 240124], 99.90th=[ 291505],
     | 99.95th=[ 549454], 99.99th=[1166017]
   bw (  KiB/s): min=    8, max=  672, per=1.58%, avg=50.71, stdev=124.96, samples=1278
   iops        : min=    2, max=  168, avg=12.64, stdev=31.25, samples=1278
  lat (usec)   : 250=0.15%, 500=0.65%, 750=2.29%, 1000=5.66%
  lat (msec)   : 2=30.98%, 4=14.11%, 10=2.23%, 20=0.06%, 50=0.23%
  lat (msec)   : 100=0.34%, 250=43.09%, 500=0.15%, 750=0.04%, 1000=0.01%
  cpu          : usr=0.00%, sys=0.07%, ctx=15864, majf=0, minf=2288
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,8173,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
  WRITE: bw=3213KiB/s (3290kB/s), 3213KiB/s-3213KiB/s (3290kB/s-3290kB/s), io=31.9MiB (33.5MB), run=10175-10175msec

Disk stats (read/write):
    dm-0: ios=0/8173, merge=0/0, ticks=0/634467, in_queue=638149, util=98.41%, aggrios=0/8173, aggrmerge=0/0, aggrticks=0/638522, aggrin_queue=638522, aggrutil=98.23%
  sda: ios=0/8173, merge=0/0, ticks=0/638522, in_queue=638522, util=98.23%

[kuro@E5800-T110f-E ~]$ fio -filename=/tmp/test2g -direct=1 -rw=randread -bs=4k -size=2G -numjobs=64 -runtime=10 -group_reporting -name=file1
file1: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=psync, iodepth=1
...
fio-3.7
Starting 64 processes
Jobs: 64 (f=64): [r(64)][100.0%][r=128MiB/s,w=0KiB/s][r=32.8k,w=0 IOPS][eta 00m:00s]
file1: (groupid=0, jobs=64): err= 0: pid=2490: Mon Aug 16 20:10:22 2021
   read: IOPS=10.8k, BW=42.0MiB/s (44.1MB/s)(425MiB/10113msec)
    clat (nsec): min=1070, max=1343.7M, avg=5903426.77, stdev=34468847.98
     lat (nsec): min=1094, max=1343.7M, avg=5903467.66, stdev=34468861.48
    clat percentiles (nsec):
     |  1.00th=[     1112],  5.00th=[     1112], 10.00th=[     1128],
     | 20.00th=[     1144], 30.00th=[     1144], 40.00th=[     1160],
     | 50.00th=[     1160], 60.00th=[     1176], 70.00th=[     1192],
     | 80.00th=[     1224], 90.00th=[    57600], 95.00th=[ 21626880],
     | 99.00th=[166723584], 99.50th=[233832448], 99.90th=[438304768],
     | 99.95th=[549453824], 99.99th=[859832320]
   bw (  KiB/s): min=    7, max= 6496, per=1.53%, avg=659.90, stdev=1078.10, samples=1235
   iops        : min=    1, max= 1624, avg=164.94, stdev=269.53, samples=1235
  lat (usec)   : 2=85.50%, 4=2.85%, 10=0.15%, 20=0.01%, 50=0.34%
  lat (usec)   : 100=2.70%, 250=0.13%, 500=0.05%, 750=0.03%, 1000=0.03%
  lat (msec)   : 2=0.13%, 4=0.43%, 10=1.30%, 20=1.24%, 50=1.75%
  lat (msec)   : 100=1.43%, 250=1.51%, 500=0.36%, 750=0.06%, 1000=0.01%
  cpu          : usr=0.01%, sys=0.04%, ctx=12651, majf=0, minf=2533
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=108831,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
   READ: bw=42.0MiB/s (44.1MB/s), 42.0MiB/s-42.0MiB/s (44.1MB/s-44.1MB/s), io=425MiB (446MB), run=10113-10113msec

Disk stats (read/write):
    dm-0: ios=12281/0, merge=0/0, ticks=625279/0, in_queue=631423, util=98.91%, aggrios=12509/0, aggrmerge=0/0, aggrticks=642278/0, aggrin_queue=642274, aggrutil=98.76%
  sda: ios=12509/0, merge=0/0, ticks=642278/0, in_queue=642274, util=98.76%


1x スロットにさした場合。
2GB / block size 4k
シーケンシャルライト
write: IOPS=94.3k, BW=368MiB/s (386MB/s)(3684MiB/10001msec)レイテンシ=750usec
シーケンシャルリード
read: IOPS=101k, BW=393MiB/s (412MB/s)(3932MiB/10013msec)レイテンシ=750usec
ランダムライト
write: IOPS=803, BW=3213KiB/s (3290kB/s)(31.9MiB/10175msec)レイテンシ=250msec
ランダムリード
read: IOPS=10.8k, BW=42.0MiB/s (44.1MB/s)(425MiB/10113msec)レイテンシ=2usec

4x スロットにさした場合。
シーケンシャルライト
write: IOPS=146k, BW=569MiB/s (596MB/s)(5688MiB/10001msec)レイテンシ=500usec
シーケンシャルリード
read: IOPS=95.5k, BW=373MiB/s (391MB/s)(3732MiB/10001msec)レイテンシ=750usec
ランダムライト
write: IOPS=782, BW=3132KiB/s (3207kB/s)(31.2MiB/10213msec)レイテンシ=250msec
ランダムリード
read: IOPS=8749, BW=34.2MiB/s (35.8MB/s)(347MiB/10155msec)レイテンシ=2usec

やっぱり1xと4xでそんなに大きくは変わらない気がする。こりゃ1xに挿しておけばOKだな。

Jetson nanoは学習フェーズに役に立つか

なんといってもまあ、CUDAコアが128コアしかないわけで、きっと学習には不向きだろうよ、という予想はたつがとりあえずmnistで試してみよう。
まずリファレンスとしてCUDAコア192のGT710/xeon e3-1330v6/64GBのCentOS7サーバでやってみる。
ソフトウェア環境としてはDocker上でtensorflow環境を構築し

$ docker run --gpus all -it --rm --name tensorflow-gpu -p 8888:8888 tensorflow/tensorflow:latest-gpu-py3-jupyter

Jupiter notebookで

import tensorflow as tf
import time
mnist = tf.keras.datasets.mnist

(x_train, y_train),(x_test, y_test) = mnist.load_data()
x_train, x_test = x_train / 255.0, x_test / 255.0

model = tf.keras.models.Sequential([
  tf.keras.layers.Flatten(),
  tf.keras.layers.Dense(512, activation=tf.nn.relu),
  tf.keras.layers.Dropout(0.2),
  tf.keras.layers.Dense(10, activation=tf.nn.softmax)
])
model.compile(optimizer='adam',
              loss='sparse_categorical_crossentropy',
              metrics=['accuracy'])
time_start = time.time()
model.fit(x_train, y_train, batch_size=1024, epochs=20)
model.evaluate(x_test, y_test)
time_end = time.time()
print("time:",(time_end - time_start))

こんなスクリプトを走らせる。

Train on 60000 samples
Epoch 1/20
60000/60000 [==============================] - 1s 24us/sample - loss: 0.5912 - accuracy: 0.8328
Epoch 2/20
60000/60000 [==============================] - 1s 22us/sample - loss: 0.2486 - accuracy: 0.9295
Epoch 3/20
60000/60000 [==============================] - 1s 21us/sample - loss: 0.1904 - accuracy: 0.9465s - loss: 0.200
Epoch 4/20
60000/60000 [==============================] - 1s 22us/sample - loss: 0.1543 - accuracy: 0.9566
Epoch 5/20
60000/60000 [==============================] - 1s 22us/sample - loss: 0.1296 - accuracy: 0.9642
Epoch 6/20
60000/60000 [==============================] - 1s 22us/sample - loss: 0.1108 - accuracy: 0.9690
Epoch 7/20
60000/60000 [==============================] - 1s 22us/sample - loss: 0.0968 - accuracy: 0.9725
Epoch 8/20
60000/60000 [==============================] - 1s 21us/sample - loss: 0.0848 - accuracy: 0.9763
Epoch 9/20
60000/60000 [==============================] - 1s 22us/sample - loss: 0.0752 - accuracy: 0.9790
Epoch 10/20
60000/60000 [==============================] - 1s 22us/sample - loss: 0.0679 - accuracy: 0.9811
Epoch 11/20
60000/60000 [==============================] - 1s 22us/sample - loss: 0.0601 - accuracy: 0.9837s - loss: 0.0606 - accuracy: 
Epoch 12/20
60000/60000 [==============================] - 1s 22us/sample - loss: 0.0554 - accuracy: 0.9838
Epoch 13/20
60000/60000 [==============================] - 1s 21us/sample - loss: 0.0496 - accuracy: 0.9863
Epoch 14/20
60000/60000 [==============================] - 1s 22us/sample - loss: 0.0452 - accuracy: 0.9876
Epoch 15/20
60000/60000 [==============================] - 1s 22us/sample - loss: 0.0418 - accuracy: 0.9884
Epoch 16/20
60000/60000 [==============================] - 1s 22us/sample - loss: 0.0375 - accuracy: 0.9895
Epoch 17/20
60000/60000 [==============================] - 1s 22us/sample - loss: 0.0346 - accuracy: 0.9906s - loss: 0.0346 - accuracy: 0.99
Epoch 18/20
60000/60000 [==============================] - 1s 22us/sample - loss: 0.0325 - accuracy: 0.9910
Epoch 19/20
60000/60000 [==============================] - 1s 22us/sample - loss: 0.0293 - accuracy: 0.9926s - loss: 0
Epoch 20/20
60000/60000 [==============================] - 1s 21us/sample - loss: 0.0279 - accuracy: 0.9928
10000/10000 [==============================] - 1s 57us/sample - loss: 0.0600 - accuracy: 0.9814
time: 27.059566020965576

初回はデータのダウンロードが最初に入るが、タイムはそれらが終わってからの計測になっている。
27.06秒ね。

同じようにJetson nano B01で

$ docker run -it --rm --runtime nvidia --network host nvcr.io/nvidia/l4t-ml:r32.5.0-py3

とDocker環境を作って、おなじスクリプトを走らせると

time: 44.97859978675842

44.98秒と。

GT710 のほうがコアが1.5倍あって速度は1.66倍高速という当たり前のような結果となった。
最底辺のGT710にすら太刀打ちできないようではやはり学習には使い物にならない。
推論に専念してもらうことにしよう。

GT710もやっぱり使い物にならないので、もうちょっとまともなグラボをどうにかしたい。

AlphaFold2の衝撃

とりあえず試してみておかないと。

qiita.com

ただ、GPUがまともなものがない。

試してみた環境は以下の通り
CPU: Xeon E3-1230V6
Memory: 64GB
Storage: 2GB(2GB x2 RAID1) + 8TB(4TBx2 RAID0)
GPU: GeForce GT710(1M)

まずは何も考えずにランしてみると

$ python3 docker/run_docker.py --fasta_paths=/mnt/fasta/test.fasta --max_template_date=2021-07-23
I0724 13:05:47.793096 139963287299840 run_docker.py:114] Mounting /mnt/ts5400r/rnaseq/fasta -> /mnt/fasta_path_0
I0724 13:05:47.793220 139963287299840 run_docker.py:114] Mounting /mnt/RAID0_8T/alphafold/database/uniref90 -> /mnt/uniref90_database_path
I0724 13:05:47.793289 139963287299840 run_docker.py:114] Mounting /mnt/RAID0_8T/alphafold/database/mgnify -> /mnt/mgnify_database_path
I0724 13:05:47.793348 139963287299840 run_docker.py:114] Mounting /mnt/RAID0_8T/alphafold/database/uniclust30/uniclust30_2018_08 -> /mnt/uniclust30_database_path
I0724 13:05:47.793409 139963287299840 run_docker.py:114] Mounting /mnt/RAID0_8T/alphafold/database/bfd -> /mnt/bfd_database_path
I0724 13:05:47.793467 139963287299840 run_docker.py:114] Mounting /mnt/RAID0_8T/alphafold/database/pdb70 -> /mnt/pdb70_database_path
I0724 13:05:47.793522 139963287299840 run_docker.py:114] Mounting /mnt/RAID0_8T/alphafold -> /mnt/data_dir
I0724 13:05:47.793577 139963287299840 run_docker.py:114] Mounting /mnt/RAID0_8T/alphafold/database/pdb_mmcif -> /mnt/template_mmcif_dir
I0724 13:05:47.793634 139963287299840 run_docker.py:114] Mounting /mnt/RAID0_8T/alphafold/database/pdb_mmcif -> /mnt/obsolete_pdbs_path
I0724 13:05:50.641787 139963287299840 run_docker.py:180] /opt/conda/lib/python3.7/site-packages/absl/flags/_validators.py:206: UserWarning: Flag --preset has a non-None default value; therefore, mark_flag_as_required will pass even if flag is not specified in the command line!
I0724 13:05:50.641920 139963287299840 run_docker.py:180] 'command line!' % flag_name)
I0724 13:05:52.431665 139963287299840 run_docker.py:180] I0724 13:05:52.431034 140038903707456 templates.py:837] Using precomputed obsolete pdbs /mnt/obsolete_pdbs_path/obsolete.dat.
I0724 13:05:53.667072 139963287299840 run_docker.py:180] I0724 13:05:53.666388 140038903707456 tpu_client.py:54] Starting the local TPU driver.
I0724 13:05:53.667418 139963287299840 run_docker.py:180] I0724 13:05:53.666860 140038903707456 xla_bridge.py:214] Unable to initialize backend 'tpu_driver': Not found: Unable to find driver in registry given worker: local://
I0724 13:05:53.907798 139963287299840 run_docker.py:180] I0724 13:05:53.907007 140038903707456 xla_bridge.py:214] Unable to initialize backend 'tpu': Invalid argument: TpuPlatform is not available.
I0724 13:05:54.334778 139963287299840 run_docker.py:180] 2021-07-24 13:05:54.334259: E external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_driver.cc:764] failed to alloc 4114612224 bytes unified memory; result: CUDA_ERROR_OUT_OF_MEMORY: out of memory
I0724 13:05:54.919834 139963287299840 run_docker.py:180] 2021-07-24 13:05:54.919301: E external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_driver.cc:764] failed to alloc 3703150848 bytes unified memory; result: CUDA_ERROR_OUT_OF_MEMORY: out of memory
I0724 13:05:55.505137 139963287299840 run_docker.py:180] 2021-07-24 13:05:55.504621: E external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_driver.cc:764] failed to alloc 3332835584 bytes unified memory; result: CUDA_ERROR_OUT_OF_MEMORY: out of memory
I0724 13:05:56.089438 139963287299840 run_docker.py:180] 2021-07-24 13:05:56.088938: E external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_driver.cc:764] failed to alloc 2999552000 bytes unified memory; result: CUDA_ERROR_OUT_OF_MEMORY: out of memory
I0724 13:05:56.677316 139963287299840 run_docker.py:180] 2021-07-24 13:05:56.676777: E external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_driver.cc:764] failed to alloc 2699596800 bytes unified memory; result: CUDA_ERROR_OUT_OF_MEMORY: out of memory
I0724 13:05:57.265036 139963287299840 run_docker.py:180] 2021-07-24 13:05:57.264540: E external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_driver.cc:764] failed to alloc 2429637120 bytes unified memory; result: CUDA_ERROR_OUT_OF_MEMORY: out of memory
I0724 13:05:57.851714 139963287299840 run_docker.py:180] 2021-07-24 13:05:57.851303: E external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_driver.cc:764] failed to alloc 2186673408 bytes unified memory; result: CUDA_ERROR_OUT_OF_MEMORY: out of memory
I0724 13:05:58.437325 139963287299840 run_docker.py:180] 2021-07-24 13:05:58.436849: E external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_driver.cc:764] failed to alloc 1968006144 bytes unified memory; result: CUDA_ERROR_OUT_OF_MEMORY: out of memory
I0724 13:05:59.023176 139963287299840 run_docker.py:180] 2021-07-24 13:05:59.022710: E external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_driver.cc:764] failed to alloc 1771205632 bytes unified memory; result: CUDA_ERROR_OUT_OF_MEMORY: out of memory
I0724 13:05:59.607919 139963287299840 run_docker.py:180] 2021-07-24 13:05:59.607428: E external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_driver.cc:764] failed to alloc 1594085120 bytes unified memory; result: CUDA_ERROR_OUT_OF_MEMORY: out of memory
I0724 13:06:00.190412 139963287299840 run_docker.py:180] 2021-07-24 13:06:00.189992: E external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_driver.cc:764] failed to alloc 1434676736 bytes unified memory; result: CUDA_ERROR_OUT_OF_MEMORY: out of memory
I0724 13:06:00.775824 139963287299840 run_docker.py:180] 2021-07-24 13:06:00.775335: E external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_driver.cc:764] failed to alloc 1291209216 bytes unified memory; result: CUDA_ERROR_OUT_OF_MEMORY: out of memory
I0724 13:06:01.346099 139963287299840 run_docker.py:180] 2021-07-24 13:06:01.345547: E external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_driver.cc:764] failed to alloc 1162088448 bytes unified memory; result: CUDA_ERROR_OUT_OF_MEMORY: out of memory
I0724 13:06:01.931352 139963287299840 run_docker.py:180] 2021-07-24 13:06:01.930887: E external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_driver.cc:764] failed to alloc 1045879552 bytes unified memory; result: CUDA_ERROR_OUT_OF_MEMORY: out of memory
I0724 13:06:05.746232 139963287299840 run_docker.py:180] 2021-07-24 13:06:05.745756: E external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_driver.cc:764] failed to alloc 3173320448 bytes unified memory; result: CUDA_ERROR_OUT_OF_MEMORY: out of memory
I0724 13:06:05.808895 139963287299840 run_docker.py:180] 2021-07-24 13:06:05.808401: E external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_driver.cc:764] failed to alloc 3173320448 bytes unified memory; result: CUDA_ERROR_OUT_OF_MEMORY: out of memory
I0724 13:06:15.837229 139963287299840 run_docker.py:180] 2021-07-24 13:06:15.836805: E external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_driver.cc:764] failed to alloc 3173320448 bytes unified memory; result: CUDA_ERROR_OUT_OF_MEMORY: out of memory
I0724 13:06:15.873901 139963287299840 run_docker.py:180] 2021-07-24 13:06:15.873463: E external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_driver.cc:764] failed to alloc 3173320448 bytes unified memory; result: CUDA_ERROR_OUT_OF_MEMORY: out of memory
I0724 13:06:15.874069 139963287299840 run_docker.py:180] 2021-07-24 13:06:15.873510: W external/org_tensorflow/tensorflow/core/common_runtime/bfc_allocator.cc:457] Allocator (GPU_0_bfc) ran out of memory trying to allocate 192.0KiB (rounded to 196608)requested by op
I0724 13:06:15.874446 139963287299840 run_docker.py:180] 2021-07-24 13:06:15.874185: W external/org_tensorflow/tensorflow/core/common_runtime/bfc_allocator.cc:468] ****************************************************************************************************
I0724 13:06:15.876205 139963287299840 run_docker.py:180] Traceback (most recent call last):
I0724 13:06:15.876372 139963287299840 run_docker.py:180] File "/app/alphafold/run_alphafold.py", line 303, in <module>
I0724 13:06:15.876452 139963287299840 run_docker.py:180] app.run(main)
I0724 13:06:15.876526 139963287299840 run_docker.py:180] File "/opt/conda/lib/python3.7/site-packages/absl/app.py", line 312, in run
I0724 13:06:15.876597 139963287299840 run_docker.py:180] _run_main(main, args)
I0724 13:06:15.876669 139963287299840 run_docker.py:180] File "/opt/conda/lib/python3.7/site-packages/absl/app.py", line 258, in _run_main
I0724 13:06:15.876739 139963287299840 run_docker.py:180] sys.exit(main(argv))
I0724 13:06:15.876816 139963287299840 run_docker.py:180] File "/app/alphafold/run_alphafold.py", line 256, in main
I0724 13:06:15.876888 139963287299840 run_docker.py:180] model_name=model_name, data_dir=FLAGS.data_dir)
I0724 13:06:15.876960 139963287299840 run_docker.py:180] File "/app/alphafold/alphafold/model/data.py", line 41, in get_model_haiku_params
I0724 13:06:15.877031 139963287299840 run_docker.py:180] return utils.flat_params_to_haiku(params)
I0724 13:06:15.877101 139963287299840 run_docker.py:180] File "/app/alphafold/alphafold/model/utils.py", line 79, in flat_params_to_haiku
I0724 13:06:15.877170 139963287299840 run_docker.py:180] hk_params[scope][name] = jnp.array(array)
I0724 13:06:15.877241 139963287299840 run_docker.py:180] File "/opt/conda/lib/python3.7/site-packages/jax/_src/numpy/lax_numpy.py", line 3044, in array
I0724 13:06:15.877312 139963287299840 run_docker.py:180] out = _device_put_raw(object, weak_type=weak_type)
I0724 13:06:15.877383 139963287299840 run_docker.py:180] File "/opt/conda/lib/python3.7/site-packages/jax/_src/lax/lax.py", line 1607, in _device_put_raw
I0724 13:06:15.877454 139963287299840 run_docker.py:180] return xla.array_result_handler(None, aval)(*xla.device_put(x))
I0724 13:06:15.877524 139963287299840 run_docker.py:180] File "/opt/conda/lib/python3.7/site-packages/jax/interpreters/xla.py", line 146, in device_put
I0724 13:06:15.877595 139963287299840 run_docker.py:180] return device_put_handlers[type(x)](x, device)
I0724 13:06:15.877666 139963287299840 run_docker.py:180] File "/opt/conda/lib/python3.7/site-packages/jax/interpreters/xla.py", line 154, in _device_put_array
I0724 13:06:15.877737 139963287299840 run_docker.py:180] return (backend.buffer_from_pyval(x, device),)
I0724 13:06:15.877811 139963287299840 run_docker.py:180] RuntimeError: Resource exhausted: Out of memory while trying to allocate 196608 bytes.

GT710のメモリが1GBしかないのでオーバーフローして止まってしまう。

そこでGPU無しで走らせるにはどうしたらいいか調べるも、決定的な記述に行き当たらない。(そりゃわざわざGPUなしでDLしようなんて酔狂なものはそうはいないだろ)

それらしい記述をスクリプトから探してTrue->Falseに書き換えてみる。

flags.DEFINE_bool('use_gpu', False, 'Enable NVIDIA runtime to run with GPUs.')
$ python3 docker/run_docker.py --fasta_paths=/mnt/fasta/test.fasta --max_template_date=2021-07-24
I0724 14:36:03.430523 140479510705920 run_docker.py:114] Mounting /mnt/ts5400r/rnaseq/fasta -> /mnt/fasta_path_0
I0724 14:36:03.430647 140479510705920 run_docker.py:114] Mounting /mnt/RAID0_8T/alphafold/database/uniref90 -> /mnt/uniref90_database_path
I0724 14:36:03.430718 140479510705920 run_docker.py:114] Mounting /mnt/RAID0_8T/alphafold/database/mgnify -> /mnt/mgnify_database_path
I0724 14:36:03.430775 140479510705920 run_docker.py:114] Mounting /mnt/RAID0_8T/alphafold/database/uniclust30/uniclust30_2018_08 -> /mnt/uniclust30_database_path
I0724 14:36:03.430836 140479510705920 run_docker.py:114] Mounting /mnt/RAID0_8T/alphafold/database/bfd -> /mnt/bfd_database_path
I0724 14:36:03.430891 140479510705920 run_docker.py:114] Mounting /mnt/RAID0_8T/alphafold/database/pdb70 -> /mnt/pdb70_database_path
I0724 14:36:03.430948 140479510705920 run_docker.py:114] Mounting /mnt/RAID0_8T/alphafold -> /mnt/data_dir
I0724 14:36:03.431000 140479510705920 run_docker.py:114] Mounting /mnt/RAID0_8T/alphafold/database/pdb_mmcif -> /mnt/template_mmcif_dir
I0724 14:36:03.431055 140479510705920 run_docker.py:114] Mounting /mnt/RAID0_8T/alphafold/database/pdb_mmcif -> /mnt/obsolete_pdbs_path
I0724 14:36:07.806614 140479510705920 run_docker.py:180] /opt/conda/lib/python3.7/site-packages/absl/flags/_validators.py:206: UserWarning: Flag --preset has a non-None default value; therefore, mark_flag_as_required will pass even if flag is not specified in the command line!
I0724 14:36:07.806749 140479510705920 run_docker.py:180] 'command line!' % flag_name)
I0724 14:36:08.187804 140479510705920 run_docker.py:180] I0724 14:36:08.187144 140639490905920 templates.py:837] Using precomputed obsolete pdbs /mnt/obsolete_pdbs_path/obsolete.dat.
I0724 14:36:09.709877 140479510705920 run_docker.py:180] I0724 14:36:09.709128 140639490905920 tpu_client.py:54] Starting the local TPU driver.
I0724 14:36:09.710193 140479510705920 run_docker.py:180] I0724 14:36:09.709654 140639490905920 xla_bridge.py:214] Unable to initialize backend 'tpu_driver': Not found: Unable to find driver in registry given worker: local://
I0724 14:36:09.710691 140479510705920 run_docker.py:180] 2021-07-24 14:36:09.710255: W external/org_tensorflow/tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64
I0724 14:36:09.710975 140479510705920 run_docker.py:180] 2021-07-24 14:36:09.710307: W external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_driver.cc:269] failed call to cuInit: UNKNOWN ERROR (303)
I0724 14:36:09.711264 140479510705920 run_docker.py:180] I0724 14:36:09.710498 140639490905920 xla_bridge.py:214] Unable to initialize backend 'gpu': Failed precondition: No visible GPU devices.
I0724 14:36:09.711437 140479510705920 run_docker.py:180] I0724 14:36:09.710731 140639490905920 xla_bridge.py:214] Unable to initialize backend 'tpu': Invalid argument: TpuPlatform is not available.
I0724 14:36:09.711596 140479510705920 run_docker.py:180] W0724 14:36:09.710883 140639490905920 xla_bridge.py:217] No GPU/TPU found, falling back to CPU. (Set TF_CPP_MIN_LOG_LEVEL=0 and rerun for more info.)
I0724 14:36:14.730792 140479510705920 run_docker.py:180] I0724 14:36:14.730233 140639490905920 run_alphafold.py:261] Have 5 models: ['model_1', 'model_2', 'model_3', 'model_4', 'model_5']
I0724 14:36:14.730963 140479510705920 run_docker.py:180] I0724 14:36:14.730385 140639490905920 run_alphafold.py:273] Using random seed 6370842927624696923 for the data pipeline
I0724 14:36:14.732356 140479510705920 run_docker.py:180] I0724 14:36:14.732125 140639490905920 jackhmmer.py:130] Launching subprocess "/usr/bin/jackhmmer -o /dev/null -A /tmp/tmppl2_77oq/output.sto --noali --F1 0.0005 --F2 5e-05 --F3 5e-07 --incE 0.0001 -E 0.0001 --cpu 8 -N 1 /mnt/fasta_path_0/test.fasta /mnt/uniref90_database_path/uniref90.fasta"
I0724 14:36:14.762296 140479510705920 run_docker.py:180] I0724 14:36:14.761586 140639490905920 utils.py:36] Started Jackhmmer (uniref90.fasta) query
I0724 14:42:12.103540 140479510705920 run_docker.py:180] I0724 14:42:12.101704 140639490905920 utils.py:40] Finished Jackhmmer (uniref90.fasta) query in 357.340 seconds
I0724 14:42:12.168576 140479510705920 run_docker.py:180] I0724 14:42:12.168016 140639490905920 jackhmmer.py:130] Launching subprocess "/usr/bin/jackhmmer -o /dev/null -A /tmp/tmphgl1pgx2/output.sto --noali --F1 0.0005 --F2 5e-05 --F3 5e-07 --incE 0.0001 -E 0.0001 --cpu 8 -N 1 /mnt/fasta_path_0/test.fasta /mnt/mgnify_database_path/mgy_clusters.fa"
I0724 14:42:12.208694 140479510705920 run_docker.py:180] I0724 14:42:12.208099 140639490905920 utils.py:36] Started Jackhmmer (mgy_clusters.fa) query
I0724 14:49:14.646541 140479510705920 run_docker.py:180] I0724 14:49:14.644667 140639490905920 utils.py:40] Finished Jackhmmer (mgy_clusters.fa) query in 422.436 seconds
I0724 14:49:17.353891 140479510705920 run_docker.py:180] I0724 14:49:17.353327 140639490905920 hhsearch.py:76] Launching subprocess "/usr/bin/hhsearch -i /tmp/tmp4yri1ytb/query.a3m -o /tmp/tmp4yri1ytb/output.hhr -maxseq 1000000 -d /mnt/pdb70_database_path/pdb70"
I0724 14:49:17.381958 140479510705920 run_docker.py:180] I0724 14:49:17.381289 140639490905920 utils.py:36] Started HHsearch query
I0724 14:49:17.668585 140479510705920 run_docker.py:180] I0724 14:49:17.667990 140639490905920 utils.py:40] Finished HHsearch query in 0.286 seconds
I0724 14:49:17.670548 140479510705920 run_docker.py:180] Traceback (most recent call last):
I0724 14:49:17.670647 140479510705920 run_docker.py:180] File "/app/alphafold/run_alphafold.py", line 303, in <module>
I0724 14:49:17.670720 140479510705920 run_docker.py:180] app.run(main)
I0724 14:49:17.670816 140479510705920 run_docker.py:180] File "/opt/conda/lib/python3.7/site-packages/absl/app.py", line 312, in run
I0724 14:49:17.670883 140479510705920 run_docker.py:180] _run_main(main, args)
I0724 14:49:17.670947 140479510705920 run_docker.py:180] File "/opt/conda/lib/python3.7/site-packages/absl/app.py", line 258, in _run_main
I0724 14:49:17.671009 140479510705920 run_docker.py:180] sys.exit(main(argv))
I0724 14:49:17.671072 140479510705920 run_docker.py:180] File "/app/alphafold/run_alphafold.py", line 285, in main
I0724 14:49:17.671134 140479510705920 run_docker.py:180] random_seed=random_seed)
I0724 14:49:17.671196 140479510705920 run_docker.py:180] File "/app/alphafold/run_alphafold.py", line 129, in predict_structure
I0724 14:49:17.671258 140479510705920 run_docker.py:180] msa_output_dir=msa_output_dir)
I0724 14:49:17.671319 140479510705920 run_docker.py:180] File "/app/alphafold/alphafold/data/pipeline.py", line 141, in process
I0724 14:49:17.671381 140479510705920 run_docker.py:180] hhsearch_result = self.hhsearch_pdb70_runner.query(uniref90_msa_as_a3m)
I0724 14:49:17.671443 140479510705920 run_docker.py:180] File "/app/alphafold/alphafold/data/tools/hhsearch.py", line 87, in query
I0724 14:49:17.671504 140479510705920 run_docker.py:180] stdout.decode('utf-8'), stderr[:100_000].decode('utf-8')))
I0724 14:49:17.671566 140479510705920 run_docker.py:180] RuntimeError: HHSearch failed:
I0724 14:49:17.671627 140479510705920 run_docker.py:180] stdout:
I0724 14:49:17.671689 140479510705920 run_docker.py:180] 
I0724 14:49:17.671751 140479510705920 run_docker.py:180] 
I0724 14:49:17.671818 140479510705920 run_docker.py:180] stderr:
I0724 14:49:17.671881 140479510705920 run_docker.py:180] - 14:49:17.586 INFO: /tmp/tmp4yri1ytb/query.a3m is in A2M, A3M or FASTA format
I0724 14:49:17.671943 140479510705920 run_docker.py:180] 
I0724 14:49:17.672005 140479510705920 run_docker.py:180] - 14:49:17.587 WARNING: Ignoring invalid symbol '*' at pos. 492 in line 2 of /tmp/tmp4yri1ytb/query.a3m
I0724 14:49:17.672068 140479510705920 run_docker.py:180] 
I0724 14:49:17.672129 140479510705920 run_docker.py:180] - 14:49:17.662 ERROR: [subseq from] Endoglucanase (Fragment) n=2 Tax=Citrus unshiu TaxID=55188 RepID=A0A2H5PNG1_CITUN
I0724 14:49:17.672191 140479510705920 run_docker.py:180] - 14:49:17.662 ERROR: Error in /tmp/hh-suite/src/hhalignment.cpp:1244: Compress:
I0724 14:49:17.672252 140479510705920 run_docker.py:180] 
I0724 14:49:17.672313 140479510705920 run_docker.py:180] - 14:49:17.662 ERROR: 	sequences in /tmp/tmp4yri1ytb/query.a3m do not all have the same number of columns,
I0724 14:49:17.672375 140479510705920 run_docker.py:180] 
I0724 14:49:17.672436 140479510705920 run_docker.py:180] - 14:49:17.662 ERROR:
I0724 14:49:17.672498 140479510705920 run_docker.py:180] e.g. first sequence and sequence UniRef90_A0A2H5PNG1/69-549.
I0724 14:49:17.672560 140479510705920 run_docker.py:180] 
I0724 14:49:17.672621 140479510705920 run_docker.py:180] - 14:49:17.662 ERROR: Check input format for '-M a2m' option and consider using '-M first' or '-M 50'
I0724 14:49:17.672683 140479510705920 run_docker.py:180] 
I0724 14:49:17.672744 140479510705920 run_docker.py:180] 
I0724 14:49:17.672822 140479510705920 run_docker.py:180] 

これはこれでまた別のエラーに行き当たるもトラブルシュートできず。

やっぱりRTX2070くらいはないと話にならないんだろうな。しかしそうするには電源、マザボから構築しないと環境がない。
とりあえずはGoogle corabの簡易版サービスでお茶を濁すしかあるまい。

ちなみに
colab.research.google.com
新版がすでに出ている。

f:id:k-kuro:20210726004237p:plain
f:id:k-kuro:20210726004307p:plain
f:id:k-kuro:20210726004323p:plain
こんな感じ。

githubで公開されているバージョンとは一部コンポーネントが違うので、精度という点ではちょっと落ちるとのこと。

linuxbrewをユーザーごとにインストールする

これまでなんだかんだ言ってサーバーへのプログラムのインストールは自分しかやらないし、ほぼシングルユーザーだったので特に問題はなかったんだが、NISサーバでユーザ管理し、クラスタの計算ノードとしつつ、一部はローカルユーザも共存というようなややこしいことをしたくなり、ローカルユーザ用にhomeディレクトリを別のところに用意した。せっかくなのでローカルユーザは個別にlinuxbrewでインストールできるようにしたい。

方法としてはインストール先を各自のホームディレクトリに指定するだけで、公式サイトにあるように

git clone https://github.com/Homebrew/brew ~/.linuxbrew/Homebrew
mkdir ~/.linuxbrew/bin
ln -s ~/.linuxbrew/Homebrew/bin/brew ~/.linuxbrew/bin
eval $(~/.linuxbrew/bin/brew shellenv)

とするだけである。

echo "eval $(~/.linuxbrew/bin/brew shellenv)">>~/.bash_profile

としておけばバッチリ。

じゃなかった。

これでは動かない。なんでだろ。要調査。

わかった。rubyが入ってなかった。

いや、問題はそれだけではないようだ。ディレクトリを変えたことによってバイナリでインストールができなくて全部ソースからインストールすることになってて、CentOS7ではgccが入らないバグが有って、全てが上手く行かないようだ。まいったな。

てことで、方針を変えて、ローカルユーザのホームディレクトリにlinuxbrewというディレクトリを作成し、これを

$ sudo mount --bind /mnt/home/linuxbrew /home/linuxbrew

というふうに上からマウントしてやるというのはどうかな。

まあ結局インストールしたいときは一旦NISユーザにsuしてインストールしてやればいいっちゃあいいので、どうしてもローカル環境で分けたいところだけmount上書きしてインストール、実行すればいいかな、という結論に達した。


追記
上のやり方ではユーザごとにインストールするバージョンを自由に選択したりできないし、他のユーザの環境を汚す可能性もあるのでやはり別にしたい。
苦肉の策なんだが、/home/liunucbrew/.linuxbrew/Cellar/gcc@5をまるっと自分の$HOME/.linuxbrew/Cellar/にコピーしてやればとりあえずイケる。しかしこれは後々面倒の種になりそうな予感。

CentOS7上のDockerでDeep learning環境を構築

とりあえずDockerをインストールしてrunできる所まで来たのでいよいよDeep learning環境を構築していこうと思う。

やることとしてはまずCentOS7にnvidiaのドライバを入れる。

これは基本以前にやったとおりでいい。
k-kuro.hatenadiary.jp

[kkuro@E5800-T110f-E ~]$ su
パスワード:
[root@E5800-T110f-E kkuro]# yum -y install kernel-devel-$(uname -r) kernel-header-$(uname -r) gcc make
読み込んだプラグイン:fastestmirror, langpacks
Loading mirror speeds from cached hostfile
epel/x86_64/metalink                                     | 3.8 kB     00:00     
 * base: mirrors.cat.net
 * epel: ftp.jaist.ac.jp
 * extras: mirrors.cat.net
 * updates: mirrors.cat.net
CityFan                                                  | 3.0 kB     00:00     
base                                                     | 3.6 kB     00:00     
docker-ce-stable                                         | 3.5 kB     00:00     
epel                                                     | 4.7 kB     00:00     
extras                                                   | 2.9 kB     00:00     
updates                                                  | 2.9 kB     00:00     
(1/2): epel/x86_64/updateinfo                              | 1.0 MB   00:00     
(2/2): epel/x86_64/primary_db                              | 6.9 MB   00:00     
パッケージ kernel-header-3.10.0-1160.31.1.el7.x86_64 は利用できません。
パッケージ gcc-4.8.5-44.el7.x86_64 はインストール済みか最新バージョンです
パッケージ 1:make-3.82-24.el7.x86_64 はインストール済みか最新バージョンです
依存性の解決をしています
--> トランザクションの確認を実行しています。
---> パッケージ kernel-devel.x86_64 0:3.10.0-1160.31.1.el7 を インストール
--> 依存性解決を終了しました。

依存性を解決しました

================================================================================
 Package            アーキテクチャー
                                 バージョン                 リポジトリー   容量
================================================================================
インストール中:
 kernel-devel       x86_64       3.10.0-1160.31.1.el7       updates        18 M

トランザクションの要約
================================================================================
インストール  1 パッケージ

総ダウンロード容量: 18 M
インストール容量: 38 M
Downloading packages:
No Presto metadata available for updates
kernel-devel-3.10.0-1160.31.1.el7.x86_64.rpm               |  18 MB   00:00     
Running transaction check
Running transaction test
Transaction test succeeded
Running transaction
  インストール中          : kernel-devel-3.10.0-1160.31.1.el7.x86_64        1/1 
  検証中                  : kernel-devel-3.10.0-1160.31.1.el7.x86_64        1/1 

インストール:
  kernel-devel.x86_64 0:3.10.0-1160.31.1.el7                                    

完了しました!
[root@E5800-T110f-E kkuro]# lspci | grep VGA
01:00.0 VGA compatible controller: NVIDIA Corporation GK208B [GeForce GT 710] (rev a1)
39:00.0 VGA compatible controller: Matrox Electronics Systems Ltd. MGA G200e [Pilot] ServerEngines (SEP1) (rev 02)
[root@E5800-T110f-E kkuro]# wget http://jp.download.nvidia.com/XFree86/Linux-x86_64/460.84/NVIDIA-Linux-x86_64-460.84.run
--2021-07-11 18:22:23--  http://jp.download.nvidia.com/XFree86/Linux-x86_64/460.84/NVIDIA-Linux-x86_64-460.84.run
jp.download.nvidia.com (jp.download.nvidia.com) をDNSに問いあわせています... 2606:2800:247:2063:46e:21d:825:102e, 192.229.232.112
jp.download.nvidia.com (jp.download.nvidia.com)|2606:2800:247:2063:46e:21d:825:102e|:80 に接続しています... 接続しました。
HTTP による接続要求を送信しました、応答を待っています... 200 OK
長さ: 177840337 (170M) [application/octet-stream]
`NVIDIA-Linux-x86_64-460.84.run' に保存中

100%[======================================>] 177,840,337 31.2MB/s 時間 6.3s   

2021-07-11 18:22:30 (26.9 MB/s) - `NVIDIA-Linux-x86_64-460.84.run' へ保存完了 [177840337/177840337]

[root@E5800-T110f-E kkuro]# bash NVIDIA-Linux-x86_64-460.84.run
Verifying archive integrity... OK
Uncompressing NVIDIA Accelerated Graphics Driver for Linux-x86_64 460.84........................................................................................

こんな感じ。

[kkuro@E5800-T110f-E ~]$ sudo nvidia-smi
[sudo] kkuro のパスワード:
Sun Jul 11 18:25:13 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.84       Driver Version: 460.84       CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  GeForce GT 710      Off  | 00000000:01:00.0 N/A |                  N/A |
| 50%   57C    P0    N/A /  N/A |      0MiB /   980MiB |     N/A      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

さて、ここからは
Installation Guide — NVIDIA Cloud Native Technologies documentation
こちらのnvidiaの中の人の解説に従ってインストールを進めていく。

[kkuro@E5800-T110f-E ~]$ distribution=$(. /etc/os-release;echo $ID$VERSION_ID) \
>    && curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.repo | sudo tee /etc/yum.repos.d/nvidia-docker.repo
[libnvidia-container]
name=libnvidia-container
baseurl=https://nvidia.github.io/libnvidia-container/stable/centos7/$basearch
repo_gpgcheck=1
gpgcheck=0
enabled=1
gpgkey=https://nvidia.github.io/libnvidia-container/gpgkey
sslverify=1
sslcacert=/etc/pki/tls/certs/ca-bundle.crt

[libnvidia-container-experimental]
name=libnvidia-container-experimental
baseurl=https://nvidia.github.io/libnvidia-container/experimental/centos7/$basearch
repo_gpgcheck=1
gpgcheck=0
enabled=0
gpgkey=https://nvidia.github.io/libnvidia-container/gpgkey
sslverify=1
sslcacert=/etc/pki/tls/certs/ca-bundle.crt

[nvidia-container-runtime]
name=nvidia-container-runtime
baseurl=https://nvidia.github.io/nvidia-container-runtime/stable/centos7/$basearch
repo_gpgcheck=1
gpgcheck=0
enabled=1
gpgkey=https://nvidia.github.io/nvidia-container-runtime/gpgkey
sslverify=1
sslcacert=/etc/pki/tls/certs/ca-bundle.crt

[nvidia-container-runtime-experimental]
name=nvidia-container-runtime-experimental
baseurl=https://nvidia.github.io/nvidia-container-runtime/experimental/centos7/$basearch
repo_gpgcheck=1
gpgcheck=0
enabled=0
gpgkey=https://nvidia.github.io/nvidia-container-runtime/gpgkey
sslverify=1
sslcacert=/etc/pki/tls/certs/ca-bundle.crt

[nvidia-docker]
name=nvidia-docker
baseurl=https://nvidia.github.io/nvidia-docker/centos7/$basearch
repo_gpgcheck=1
gpgcheck=0
enabled=1
gpgkey=https://nvidia.github.io/nvidia-docker/gpgkey
sslverify=1
sslcacert=/etc/pki/tls/certs/ca-bundle.crt
[kkuro@E5800-T110f-E ~]$ sudo yum clean expire-cache
読み込んだプラグイン:fastestmirror, langpacks
リポジトリーを清掃しています: CityFan base docker-ce-stable epel extras
     ...: libnvidia-container nvidia-container-runtime nvidia-docker updates
9 個の metadata ファイルを削除しました
[kkuro@E5800-T110f-E ~]$ sudo yum install -y nvidia-docker2
読み込んだプラグイン:fastestmirror, langpacks
Loading mirror speeds from cached hostfile
epel/x86_64/metalink                                     | 3.8 kB     00:00     
 * base: mirrors.cat.net
 * epel: ftp.jaist.ac.jp
 * extras: mirrors.cat.net
 * updates: mirrors.cat.net
CityFan                                                  | 3.0 kB     00:00     
base                                                     | 3.6 kB     00:00     
docker-ce-stable                                         | 3.5 kB     00:00     
extras                                                   | 2.9 kB     00:00     
libnvidia-container/x86_64/signature                     |  833 B     00:00     
https://nvidia.github.io/libnvidia-container/gpgkey から鍵を取得中です。
Importing GPG key 0xF796ECB0:
 Userid     : "NVIDIA CORPORATION (Open Source Projects) <cudatools@nvidia.com>"
 Fingerprint: c95b 321b 61e8 8c18 09c4 f759 ddca e044 f796 ecb0
 From       : https://nvidia.github.io/libnvidia-container/gpgkey
libnvidia-container/x86_64/signature                     | 2.1 kB     00:00 !!! 
nvidia-container-runtime/x86_64/signature                |  833 B     00:00     
https://nvidia.github.io/nvidia-container-runtime/gpgkey から鍵を取得中です。
Importing GPG key 0xF796ECB0:
 Userid     : "NVIDIA CORPORATION (Open Source Projects) <cudatools@nvidia.com>"
 Fingerprint: c95b 321b 61e8 8c18 09c4 f759 ddca e044 f796 ecb0
 From       : https://nvidia.github.io/nvidia-container-runtime/gpgkey
nvidia-container-runtime/x86_64/signature                | 2.1 kB     00:00 !!! 
nvidia-docker/x86_64/signature                           |  833 B     00:00     
https://nvidia.github.io/nvidia-docker/gpgkey から鍵を取得中です。
Importing GPG key 0xF796ECB0:
 Userid     : "NVIDIA CORPORATION (Open Source Projects) <cudatools@nvidia.com>"
 Fingerprint: c95b 321b 61e8 8c18 09c4 f759 ddca e044 f796 ecb0
 From       : https://nvidia.github.io/nvidia-docker/gpgkey
nvidia-docker/x86_64/signature                           | 2.1 kB     00:00 !!! 
updates                                                  | 2.9 kB     00:00     
(1/3): libnvidia-container/x86_64/primary                  |  17 kB   00:00     
(2/3): nvidia-docker/x86_64/primary                        | 8.0 kB   00:00     
(3/3): nvidia-container-runtime/x86_64/primary             |  11 kB   00:00     
libnvidia-container                                                     105/105
nvidia-container-runtime                                                  71/71
nvidia-docker                                                             54/54
依存性の解決をしています
--> トランザクションの確認を実行しています。
---> パッケージ nvidia-docker2.noarch 0:2.6.0-1 を インストール
--> 依存性の処理をしています: nvidia-container-runtime >= 3.5.0 のパッケージ: nvidia-docker2-2.6.0-1.noarch
--> トランザクションの確認を実行しています。
---> パッケージ nvidia-container-runtime.x86_64 0:3.5.0-1 を インストール
--> 依存性の処理をしています: nvidia-container-toolkit < 2.0.0 のパッケージ: nvidia-container-runtime-3.5.0-1.x86_64
--> 依存性の処理をしています: nvidia-container-toolkit >= 1.5.0 のパッケージ: nvidia-container-runtime-3.5.0-1.x86_64
--> トランザクションの確認を実行しています。
---> パッケージ nvidia-container-toolkit.x86_64 0:1.5.1-2 を インストール
--> 依存性の処理をしています: libnvidia-container-tools < 2.0.0 のパッケージ: nvidia-container-toolkit-1.5.1-2.x86_64
--> 依存性の処理をしています: libnvidia-container-tools >= 1.4.0 のパッケージ: nvidia-container-toolkit-1.5.1-2.x86_64
--> トランザクションの確認を実行しています。
---> パッケージ libnvidia-container-tools.x86_64 0:1.4.0-1 を インストール
--> 依存性の処理をしています: libnvidia-container1(x86-64) >= 1.4.0-1 のパッケージ: libnvidia-container-tools-1.4.0-1.x86_64
--> 依存性の処理をしています: libnvidia-container.so.1(NVC_1.0)(64bit) のパッケージ: libnvidia-container-tools-1.4.0-1.x86_64
--> 依存性の処理をしています: libnvidia-container.so.1()(64bit) のパッケージ: libnvidia-container-tools-1.4.0-1.x86_64
--> トランザクションの確認を実行しています。
---> パッケージ libnvidia-container1.x86_64 0:1.4.0-1 を インストール
--> 依存性解決を終了しました。

依存性を解決しました

================================================================================
 Package                    アーキテクチャー
                                    バージョン  リポジトリー               容量
================================================================================
インストール中:
 nvidia-docker2             noarch  2.6.0-1     nvidia-docker             9.0 k
依存性関連でのインストールをします:
 libnvidia-container-tools  x86_64  1.4.0-1     libnvidia-container        43 k
 libnvidia-container1       x86_64  1.4.0-1     libnvidia-container        87 k
 nvidia-container-runtime   x86_64  3.5.0-1     nvidia-container-runtime  827 k
 nvidia-container-toolkit   x86_64  1.5.1-2     nvidia-container-runtime  764 k

トランザクションの要約
================================================================================
インストール  1 パッケージ (+4 個の依存関係のパッケージ)

総ダウンロード容量: 1.7 M
インストール容量: 4.6 M
Downloading packages:
(1/5): libnvidia-container-tools-1.4.0-1.x86_64.rpm        |  43 kB   00:00     
(2/5): nvidia-docker2-2.6.0-1.noarch.rpm                   | 9.0 kB   00:00     
(3/5): libnvidia-container1-1.4.0-1.x86_64.rpm             |  87 kB   00:00     
(4/5): nvidia-container-toolkit-1.5.1-2.x86_64.rpm         | 764 kB   00:00     
(5/5): nvidia-container-runtime-3.5.0-1.x86_64.rpm         | 827 kB   00:05     
--------------------------------------------------------------------------------
合計                                               288 kB/s | 1.7 MB  00:06     
Running transaction check
Running transaction test
Transaction test succeeded
Running transaction
  インストール中          : libnvidia-container1-1.4.0-1.x86_64             1/5 
  インストール中          : libnvidia-container-tools-1.4.0-1.x86_64        2/5 
  インストール中          : nvidia-container-toolkit-1.5.1-2.x86_64         3/5 
  インストール中          : nvidia-container-runtime-3.5.0-1.x86_64         4/5 
  インストール中          : nvidia-docker2-2.6.0-1.noarch                   5/5 
  検証中                  : nvidia-container-toolkit-1.5.1-2.x86_64         1/5 
  検証中                  : nvidia-container-runtime-3.5.0-1.x86_64         2/5 
  検証中                  : nvidia-docker2-2.6.0-1.noarch                   3/5 
  検証中                  : libnvidia-container1-1.4.0-1.x86_64             4/5 
  検証中                  : libnvidia-container-tools-1.4.0-1.x86_64        5/5 

インストール:
  nvidia-docker2.noarch 0:2.6.0-1                                               

依存性関連をインストールしました:
  libnvidia-container-tools.x86_64 0:1.4.0-1                                    
  libnvidia-container1.x86_64 0:1.4.0-1                                         
  nvidia-container-runtime.x86_64 0:3.5.0-1                                     
  nvidia-container-toolkit.x86_64 0:1.5.1-2                                     

完了しました!
[kkuro@E5800-T110f-E ~]$ sudo systemctl restart docker
[kkuro@E5800-T110f-E ~]$ sudo docker run --rm --gpus all nvidia/cuda:11.0-base nvidia-smi
Unable to find image 'nvidia/cuda:11.0-base' locally
11.0-base: Pulling from nvidia/cuda
54ee1f796a1e: Pull complete 
f7bfea53ad12: Pull complete 
46d371e02073: Pull complete 
b66c17bbf772: Pull complete 
3642f1a6dfb3: Pull complete 
e5ce55b8b4b9: Pull complete 
155bc0332b0a: Pull complete 
Digest: sha256:774ca3d612de15213102c2dbbba55df44dc5cf9870ca2be6c6e9c627fa63d67a
Status: Downloaded newer image for nvidia/cuda:11.0-base
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.84       Driver Version: 460.84       CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  GeForce GT 710      Off  | 00000000:01:00.0 N/A |                  N/A |
| 50%   57C    P0    N/A /  N/A |      0MiB /   980MiB |     N/A      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

よしよしちゃんと動くじゃないか。

tensorflowのdockerも試してみよう。

[kkuro@E5800-T110f-E ~]$ docker run --gpus all -it --rm --name tensorflow-gpu -p 8888:8888 tensorflow/tensorflow:latest-gpu-py3-jupyter
Unable to find image 'tensorflow/tensorflow:latest-gpu-py3-jupyter' locally
latest-gpu-py3-jupyter: Pulling from tensorflow/tensorflow
7ddbc47eeb70: Pulling fs layer 
c1bbdc448b72: Pulling fs layer 
8c3b70e39044: Pulling fs layer 
45d437916d57: Pulling fs layer 
d8f1569ddae6: Pulling fs layer 
85386706b020: Pulling fs layer 
ee9b457b77d0: Pulling fs layer 
bebfcc1316f7: Pulling fs layer 
644140fd95a9: Pull complete 
d6c0f989e873: Pull complete 
7a8e64f26211: Pull complete 
c33b03e4dd22: Pull complete 
bca93af797c1: Pull complete 
47f6c197be35: Pull complete 
e5da48aa9554: Pull complete 
ca68d98a90c4: Pull complete 
2059de27f7c8: Pull complete 
55d02aea1458: Pull complete 
32162ecb0c59: Pull complete 
47520dc72e8e: Pull complete 
3dafed94e1f2: Pull complete 
dc228e76e4f0: Pull complete 
2c6922dc5a5f: Pull complete 
a960e6d108fd: Pull complete 
6818a780ae00: Pull complete 
06dfbeeed7ba: Pull complete 
5890e026a0a0: Pull complete 
eeddfe30f3d2: Pull complete 
187170305445: Pull complete 
2e20a8960c42: Pull complete 
9f1bf726c909: Pull complete 
Digest: sha256:901b827b19d14aa0dd79ebbd45f410ee9dbfa209f6a4db71041b5b8ae144fea5
Status: Downloaded newer image for tensorflow/tensorflow:latest-gpu-py3-jupyter

________                               _______________                
___  __/__________________________________  ____/__  /________      __
__  /  _  _ \_  __ \_  ___/  __ \_  ___/_  /_   __  /_  __ \_ | /| / /
_  /   /  __/  / / /(__  )/ /_/ /  /   _  __/   _  / / /_/ /_ |/ |/ / 
/_/    \___//_/ /_//____/ \____//_/    /_/      /_/  \____/____/|__/


WARNING: You are running this container as root, which can cause new files in
mounted volumes to be created as the root user on your host machine.

To avoid this, run the container by specifying your user's userid:

$ docker run -u $(id -u):$(id -g) args...

[I 09:46:31.559 NotebookApp] Writing notebook server cookie secret to /root/.local/share/jupyter/runtime/notebook_cookie_secret
jupyter_http_over_ws extension initialized. Listening on /http_over_websocket
[I 09:46:31.739 NotebookApp] Serving notebooks from local directory: /tf
[I 09:46:31.739 NotebookApp] The Jupyter Notebook is running at:
[I 09:46:31.739 NotebookApp] http://dea93aef777e:8888/?token=95c9c68d822e8403e3a886e916eaf1ba3c7981e5cbb5f789
[I 09:46:31.739 NotebookApp]  or http://127.0.0.1:8888/?token=95c9c68d822e8403e3a886e916eaf1ba3c7981e5cbb5f789
[I 09:46:31.739 NotebookApp] Use Control-C to stop this server and shut down all kernels (twice to skip confirmation).
[C 09:46:31.743 NotebookApp] 
    
    To access the notebook, open this file in a browser:
        file:///root/.local/share/jupyter/runtime/nbserver-1-open.html
    Or copy and paste one of these URLs:
        http://dea93aef777e:8888/?token=95c9c68d822e8403e3a886e916eaf1ba3c7981e5cbb5f789
     or http://127.0.0.1:8888/?token=95c9c68d822e8403e3a886e916eaf1ba3c7981e5cbb5f789

f:id:k-kuro:20210711193603p:plain

ちゃんと起動するし、webブラウザでもJupyter notebookが起動し、コマンドも走るよ。
f:id:k-kuro:20210711193315p:plain

ちなみにホスト名からも分かる通りHaswell世代のポンコツサーバ(2000円でゲット)にGT710などという最底辺グラボなんで、実際のところヤクタタズなんだが。

CentOS7にDocker

これまでデスクトップサーバで深層学習環境をUbuntuに構築するこころみをしてきたが、やはり時代はDockerだな、ということでUbuntuにDockerをインストールしたりまではやっていたのだが、なんというかやはりサーバとしてはUbuntuって凄く使い難い。なんか専用のモニタのくっついたデスクトップマシンでないとリモートではどうも使いにくいのだ。
てことで使いなれたCentOSにまた舞い戻ってきた。なんかここ行ったり来たりしてる。
ちなみにCentOS7は最小構成でインストールし、groupinstallで "GNOME Desktop"をいれ、念の為linuxbrewをインストールしただけの最低限の準備で、あとはDocker上で全てどうにかする予定。現在CentOS7にlinuxbrewをインストールする手順は
k-kuro.hatenadiary.jp
ここで書いた通りの手順がやはり必要。

[kkuro@E5800-T110f-E ~]$ su
Password: 
[root@E5800-T110f-E kkuro]# yum update
Loaded plugins: fastestmirror, langpacks
Loading mirror speeds from cached hostfile
 * base: mirrors.cat.net
 * epel: ftp.jaist.ac.jp
 * extras: mirrors.cat.net
 * updates: mirrors.cat.net
Resolving Dependencies
--> Running transaction check

〜〜中略〜〜

Complete!
[root@E5800-T110f-E kkuro]# yum upgrade
Loaded plugins: fastestmirror, langpacks
Loading mirror speeds from cached hostfile
 * base: mirrors.cat.net
 * epel: ftp.jaist.ac.jp
 * extras: mirrors.cat.net
 * updates: mirrors.cat.net
No packages marked for update
[root@E5800-T110f-E kkuro]# yum install -y yum-utils device-mapper-persistent-data lvm2
Loaded plugins: fastestmirror, langpacks
Loading mirror speeds from cached hostfile
 * base: mirrors.cat.net
 * epel: ftp.jaist.ac.jp
 * extras: mirrors.cat.net
 * updates: mirrors.cat.net
Package yum-utils-1.1.31-54.el7_8.noarch already installed and latest version
Package device-mapper-persistent-data-0.8.5-3.el7_9.2.x86_64 already installed and latest version
Package 7:lvm2-2.02.187-6.el7_9.5.x86_64 already installed and latest version
Nothing to do
[root@E5800-T110f-E kkuro]# yum-config-manager --add-repo https://download.docker.com/linux/centos/docker-ce.repo
Loaded plugins: fastestmirror, langpacks
adding repo from: https://download.docker.com/linux/centos/docker-ce.repo
grabbing file https://download.docker.com/linux/centos/docker-ce.repo to /etc/yum.repos.d/docker-ce.repo
repo saved to /etc/yum.repos.d/docker-ce.repo
[root@E5800-T110f-E kkuro]# yum install -y docker-ce docker-ce-cli containerd.io
Loaded plugins: fastestmirror, langpacks
Loading mirror speeds from cached hostfile
 * base: mirrors.cat.net
 * epel: ftp.jaist.ac.jp
 * extras: mirrors.cat.net
 * updates: mirrors.cat.net
docker-ce-stable                                         | 3.5 kB     00:00     
(1/2): docker-ce-stable/7/x86_64/updateinfo                |   55 B   00:00     
(2/2): docker-ce-stable/7/x86_64/primary_db                |  62 kB   00:00     
Resolving Dependencies
--> Running transaction check
---> Package containerd.io.x86_64 0:1.4.6-3.1.el7 will be installed
--> Processing Dependency: container-selinux >= 2:2.74 for package: containerd.io-1.4.6-3.1.el7.x86_64
---> Package docker-ce.x86_64 3:20.10.7-3.el7 will be installed
--> Processing Dependency: docker-ce-rootless-extras for package: 3:docker-ce-20.10.7-3.el7.x86_64
---> Package docker-ce-cli.x86_64 1:20.10.7-3.el7 will be installed
--> Processing Dependency: docker-scan-plugin(x86-64) for package: 1:docker-ce-cli-20.10.7-3.el7.x86_64
--> Running transaction check
---> Package container-selinux.noarch 2:2.119.2-1.911c772.el7_8 will be installed
---> Package docker-ce-rootless-extras.x86_64 0:20.10.7-3.el7 will be installed
--> Processing Dependency: fuse-overlayfs >= 0.7 for package: docker-ce-rootless-extras-20.10.7-3.el7.x86_64
--> Processing Dependency: slirp4netns >= 0.4 for package: docker-ce-rootless-extras-20.10.7-3.el7.x86_64
---> Package docker-scan-plugin.x86_64 0:0.8.0-3.el7 will be installed
--> Running transaction check
---> Package fuse-overlayfs.x86_64 0:0.7.2-6.el7_8 will be installed
--> Processing Dependency: libfuse3.so.3(FUSE_3.2)(64bit) for package: fuse-overlayfs-0.7.2-6.el7_8.x86_64
--> Processing Dependency: libfuse3.so.3(FUSE_3.0)(64bit) for package: fuse-overlayfs-0.7.2-6.el7_8.x86_64
--> Processing Dependency: libfuse3.so.3()(64bit) for package: fuse-overlayfs-0.7.2-6.el7_8.x86_64
---> Package slirp4netns.x86_64 0:0.4.3-4.el7_8 will be installed
--> Running transaction check
---> Package fuse3-libs.x86_64 0:3.6.1-4.el7 will be installed
--> Finished Dependency Resolution

Dependencies Resolved

================================================================================
 Package                Arch   Version                   Repository        Size
================================================================================
Installing:
 containerd.io          x86_64 1.4.6-3.1.el7             docker-ce-stable  34 M
 docker-ce              x86_64 3:20.10.7-3.el7           docker-ce-stable  27 M
 docker-ce-cli          x86_64 1:20.10.7-3.el7           docker-ce-stable  33 M
Installing for dependencies:
 container-selinux      noarch 2:2.119.2-1.911c772.el7_8 extras            40 k
 docker-ce-rootless-extras
                        x86_64 20.10.7-3.el7             docker-ce-stable 9.2 M
 docker-scan-plugin     x86_64 0.8.0-3.el7               docker-ce-stable 4.2 M
 fuse-overlayfs         x86_64 0.7.2-6.el7_8             extras            54 k
 fuse3-libs             x86_64 3.6.1-4.el7               extras            82 k
 slirp4netns            x86_64 0.4.3-4.el7_8             extras            81 k

Transaction Summary
================================================================================
Install  3 Packages (+6 Dependent packages)

Total download size: 107 M
Installed size: 438 M
Downloading packages:
(1/9): container-selinux-2.119.2-1.911c772.el7_8.noarch.rp |  40 kB   00:00     
warning: /var/cache/yum/x86_64/7/docker-ce-stable/packages/containerd.io-1.4.6-3.1.el7.x86_64.rpm: Header V4 RSA/SHA512 Signature, key ID 621e9f35: NOKEY
Public key for containerd.io-1.4.6-3.1.el7.x86_64.rpm is not installed
(2/9): containerd.io-1.4.6-3.1.el7.x86_64.rpm              |  34 MB   00:01     
(3/9): docker-ce-20.10.7-3.el7.x86_64.rpm                  |  27 MB   00:01     
(4/9): docker-ce-rootless-extras-20.10.7-3.el7.x86_64.rpm  | 9.2 MB   00:00     
(5/9): fuse-overlayfs-0.7.2-6.el7_8.x86_64.rpm             |  54 kB   00:00     
(6/9): fuse3-libs-3.6.1-4.el7.x86_64.rpm                   |  82 kB   00:00     
(7/9): slirp4netns-0.4.3-4.el7_8.x86_64.rpm                |  81 kB   00:00     
(8/9): docker-scan-plugin-0.8.0-3.el7.x86_64.rpm           | 4.2 MB   00:00     
(9/9): docker-ce-cli-20.10.7-3.el7.x86_64.rpm              |  33 MB   00:01     
--------------------------------------------------------------------------------
Total                                               38 MB/s | 107 MB  00:02     
Retrieving key from https://download.docker.com/linux/centos/gpg
Importing GPG key 0x621E9F35:
 Userid     : "Docker Release (CE rpm) <docker@docker.com>"
 Fingerprint: 060a 61c5 1b55 8a7f 742b 77aa c52f eb6b 621e 9f35
 From       : https://download.docker.com/linux/centos/gpg
Running transaction check
Running transaction test
Transaction test succeeded
Running transaction
  Installing : 2:container-selinux-2.119.2-1.911c772.el7_8.noarch           1/9 
  Installing : containerd.io-1.4.6-3.1.el7.x86_64                           2/9 
  Installing : 1:docker-ce-cli-20.10.7-3.el7.x86_64                         3/9 
  Installing : docker-scan-plugin-0.8.0-3.el7.x86_64                        4/9 
  Installing : slirp4netns-0.4.3-4.el7_8.x86_64                             5/9 
  Installing : fuse3-libs-3.6.1-4.el7.x86_64                                6/9 
  Installing : fuse-overlayfs-0.7.2-6.el7_8.x86_64                          7/9 
  Installing : docker-ce-rootless-extras-20.10.7-3.el7.x86_64               8/9 
  Installing : 3:docker-ce-20.10.7-3.el7.x86_64                             9/9 
  Verifying  : containerd.io-1.4.6-3.1.el7.x86_64                           1/9 
  Verifying  : fuse3-libs-3.6.1-4.el7.x86_64                                2/9 
  Verifying  : docker-scan-plugin-0.8.0-3.el7.x86_64                        3/9 
  Verifying  : slirp4netns-0.4.3-4.el7_8.x86_64                             4/9 
  Verifying  : 2:container-selinux-2.119.2-1.911c772.el7_8.noarch           5/9 
  Verifying  : 3:docker-ce-20.10.7-3.el7.x86_64                             6/9 
  Verifying  : 1:docker-ce-cli-20.10.7-3.el7.x86_64                         7/9 
  Verifying  : docker-ce-rootless-extras-20.10.7-3.el7.x86_64               8/9 
  Verifying  : fuse-overlayfs-0.7.2-6.el7_8.x86_64                          9/9 

Installed:
  containerd.io.x86_64 0:1.4.6-3.1.el7     docker-ce.x86_64 3:20.10.7-3.el7    
  docker-ce-cli.x86_64 1:20.10.7-3.el7    

Dependency Installed:
  container-selinux.noarch 2:2.119.2-1.911c772.el7_8                            
  docker-ce-rootless-extras.x86_64 0:20.10.7-3.el7                              
  docker-scan-plugin.x86_64 0:0.8.0-3.el7                                       
  fuse-overlayfs.x86_64 0:0.7.2-6.el7_8                                         
  fuse3-libs.x86_64 0:3.6.1-4.el7                                               
  slirp4netns.x86_64 0:0.4.3-4.el7_8                                            

Complete!
[root@E5800-T110f-E kkuro]# systemctl start docker
[root@E5800-T110f-E kkuro]# systemctl enable docker
Created symlink from /etc/systemd/system/multi-user.target.wants/docker.service to /usr/lib/systemd/system/docker.service.
[root@E5800-T110f-E kkuro]# docker --version
Docker version 20.10.7, build f0df350
[root@E5800-T110f-E kkuro]# docker version
Client: Docker Engine - Community
 Version:           20.10.7
 API version:       1.41
 Go version:        go1.13.15
 Git commit:        f0df350
 Built:             Wed Jun  2 11:58:10 2021
 OS/Arch:           linux/amd64
 Context:           default
 Experimental:      true

Server: Docker Engine - Community
 Engine:
  Version:          20.10.7
  API version:      1.41 (minimum version 1.12)
  Go version:       go1.13.15
  Git commit:       b0f5bc3
  Built:            Wed Jun  2 11:56:35 2021
  OS/Arch:          linux/amd64
  Experimental:     false
 containerd:
  Version:          1.4.6
  GitCommit:        d71fcd7d8303cbf684402823e425e9dd2e99285d
 runc:
  Version:          1.0.0-rc95
  GitCommit:        b9ee9c6314599f1b4a7f497e1f1f856fe433d3b7
 docker-init:
  Version:          0.19.0
  GitCommit:        de40ad0
[root@E5800-T110f-E kkuro]# docker run hello-world
Unable to find image 'hello-world:latest' locally
latest: Pulling from library/hello-world
b8dfde127a29: Pull complete 
Digest: sha256:df5f5184104426b65967e016ff2ac0bfcd44ad7899ca3bbcf8e44e4461491a9e
Status: Downloaded newer image for hello-world:latest

Hello from Docker!
This message shows that your installation appears to be working correctly.

To generate this message, Docker took the following steps:
 1. The Docker client contacted the Docker daemon.
 2. The Docker daemon pulled the "hello-world" image from the Docker Hub.
    (amd64)
 3. The Docker daemon created a new container from that image which runs the
    executable that produces the output you are currently reading.
 4. The Docker daemon streamed that output to the Docker client, which sent it
    to your terminal.

To try something more ambitious, you can run an Ubuntu container with:
 $ docker run -it ubuntu bash

Share images, automate workflows, and more with a free Docker ID:
 https://hub.docker.com/

For more examples and ideas, visit:
 https://docs.docker.com/get-started/

これで使えるようになった。

Dockerの参考図書購入。ちょっと初心者向け過ぎかもしれない。Linuxを日常使いしている向きにはかなり初歩的なところからの解説が面倒くさく感じるかもしれない。基本Windowsユーザ(またはUnix的に使用していないMacユーザ)を対象としている。
一応Linuxベースの解説もあるが基本UbuntuCentOS/RHEL系の解説はないので、そのへんはネットで探す必要があった。
【CentOS7】Dockerインストールと起動 | インフラエンジニアの技術LOG
こちらのブログを参考にさせていただいた。

React JSはじめました。

さっきやっとこさJSONやらAjaxをつかってフロントエンドとバックエンドのやり取りを成功させたところだが、
気を良くしてReactにも手を出してみる。

まずnpmってなんなん?ってとこからですよ。
Node.jsってのをまず入れるんですね。

$ sudo yum install centos-release-scl-rh
$ sudo yum install rh-nodejs10
$ scl enable rh-nodejs10 bash
$ which node
/opt/rh/rh-nodejs10/root/usr/bin/node
$ node --version
v10.21.0
$ node
> console.log('Node is running');
Node is running
undefined
> .help
.break    Sometimes you get stuck, this gets you out
.clear    Alias for .break
.editor   Enter editor mode
.exit     Exit the repl
.help     Print this help message
.load     Load JS from a file into the REPL session
.save     Save all evaluated commands in this REPL session to a file
> .exit
$ which npm
/opt/rh/rh-nodejs10/root/usr/bin/npm
$ npm --version
6.14.4

これでよかろう。

モジュールをインストールしてみる。

$ npm install --save react-calendar-timeline
npm WARN saveError ENOENT: no such file or directory, open '/home/kkuro/database/package.json'
npm notice created a lockfile as package-lock.json. You should commit this file.
npm WARN enoent ENOENT: no such file or directory, open '/home/kkuro/database/package.json'
npm WARN react-calendar-timeline@0.27.0 requires a peer of interactjs@^1.3.4 but none is installed. You must install peer dependencies yourself.
npm WARN react-calendar-timeline@0.27.0 requires a peer of moment@* but none is installed. You must install peer dependencies yourself.
npm WARN react-calendar-timeline@0.27.0 requires a peer of prop-types@^15.6.2 but none is installed. You must install peer dependencies yourself.
npm WARN react-calendar-timeline@0.27.0 requires a peer of react@>=16.3 but none is installed. You must install peer dependencies yourself.
npm WARN react-calendar-timeline@0.27.0 requires a peer of react-dom@>=16.3 but none is installed. You must install peer dependencies yourself.
npm WARN create-react-context@0.3.0 requires a peer of prop-types@^15.0.0 but none is installed. You must install peer dependencies yourself.
npm WARN create-react-context@0.3.0 requires a peer of react@^0.14.0 || ^15.0.0 || ^16.0.0 but none is installed. You must install peer dependencies yourself.
npm WARN database No description
npm WARN database No repository field.
npm WARN database No README data
npm WARN database No license field.

react-calendar-timeline
これを使いたいのだよ。
インストールはこれでいいんかいな?WARN出まくっているけど。

とりあえずもうちょっと簡単な例からやってみないとな。