とりあえず試してみておかないと。
ただ、GPUがまともなものがない。
試してみた環境は以下の通り
CPU: Xeon E3-1230V6
Memory: 64GB
Storage: 2GB(2GB x2 RAID1) + 8TB(4TBx2 RAID0)
GPU: GeForce GT710(1M)
まずは何も考えずにランしてみると
$ python3 docker/run_docker.py --fasta_paths=/mnt/fasta/test.fasta --max_template_date=2021-07-23 I0724 13:05:47.793096 139963287299840 run_docker.py:114] Mounting /mnt/ts5400r/rnaseq/fasta -> /mnt/fasta_path_0 I0724 13:05:47.793220 139963287299840 run_docker.py:114] Mounting /mnt/RAID0_8T/alphafold/database/uniref90 -> /mnt/uniref90_database_path I0724 13:05:47.793289 139963287299840 run_docker.py:114] Mounting /mnt/RAID0_8T/alphafold/database/mgnify -> /mnt/mgnify_database_path I0724 13:05:47.793348 139963287299840 run_docker.py:114] Mounting /mnt/RAID0_8T/alphafold/database/uniclust30/uniclust30_2018_08 -> /mnt/uniclust30_database_path I0724 13:05:47.793409 139963287299840 run_docker.py:114] Mounting /mnt/RAID0_8T/alphafold/database/bfd -> /mnt/bfd_database_path I0724 13:05:47.793467 139963287299840 run_docker.py:114] Mounting /mnt/RAID0_8T/alphafold/database/pdb70 -> /mnt/pdb70_database_path I0724 13:05:47.793522 139963287299840 run_docker.py:114] Mounting /mnt/RAID0_8T/alphafold -> /mnt/data_dir I0724 13:05:47.793577 139963287299840 run_docker.py:114] Mounting /mnt/RAID0_8T/alphafold/database/pdb_mmcif -> /mnt/template_mmcif_dir I0724 13:05:47.793634 139963287299840 run_docker.py:114] Mounting /mnt/RAID0_8T/alphafold/database/pdb_mmcif -> /mnt/obsolete_pdbs_path I0724 13:05:50.641787 139963287299840 run_docker.py:180] /opt/conda/lib/python3.7/site-packages/absl/flags/_validators.py:206: UserWarning: Flag --preset has a non-None default value; therefore, mark_flag_as_required will pass even if flag is not specified in the command line! I0724 13:05:50.641920 139963287299840 run_docker.py:180] 'command line!' % flag_name) I0724 13:05:52.431665 139963287299840 run_docker.py:180] I0724 13:05:52.431034 140038903707456 templates.py:837] Using precomputed obsolete pdbs /mnt/obsolete_pdbs_path/obsolete.dat. I0724 13:05:53.667072 139963287299840 run_docker.py:180] I0724 13:05:53.666388 140038903707456 tpu_client.py:54] Starting the local TPU driver. I0724 13:05:53.667418 139963287299840 run_docker.py:180] I0724 13:05:53.666860 140038903707456 xla_bridge.py:214] Unable to initialize backend 'tpu_driver': Not found: Unable to find driver in registry given worker: local:// I0724 13:05:53.907798 139963287299840 run_docker.py:180] I0724 13:05:53.907007 140038903707456 xla_bridge.py:214] Unable to initialize backend 'tpu': Invalid argument: TpuPlatform is not available. I0724 13:05:54.334778 139963287299840 run_docker.py:180] 2021-07-24 13:05:54.334259: E external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_driver.cc:764] failed to alloc 4114612224 bytes unified memory; result: CUDA_ERROR_OUT_OF_MEMORY: out of memory I0724 13:05:54.919834 139963287299840 run_docker.py:180] 2021-07-24 13:05:54.919301: E external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_driver.cc:764] failed to alloc 3703150848 bytes unified memory; result: CUDA_ERROR_OUT_OF_MEMORY: out of memory I0724 13:05:55.505137 139963287299840 run_docker.py:180] 2021-07-24 13:05:55.504621: E external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_driver.cc:764] failed to alloc 3332835584 bytes unified memory; result: CUDA_ERROR_OUT_OF_MEMORY: out of memory I0724 13:05:56.089438 139963287299840 run_docker.py:180] 2021-07-24 13:05:56.088938: E external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_driver.cc:764] failed to alloc 2999552000 bytes unified memory; result: CUDA_ERROR_OUT_OF_MEMORY: out of memory I0724 13:05:56.677316 139963287299840 run_docker.py:180] 2021-07-24 13:05:56.676777: E external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_driver.cc:764] failed to alloc 2699596800 bytes unified memory; result: CUDA_ERROR_OUT_OF_MEMORY: out of memory I0724 13:05:57.265036 139963287299840 run_docker.py:180] 2021-07-24 13:05:57.264540: E external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_driver.cc:764] failed to alloc 2429637120 bytes unified memory; result: CUDA_ERROR_OUT_OF_MEMORY: out of memory I0724 13:05:57.851714 139963287299840 run_docker.py:180] 2021-07-24 13:05:57.851303: E external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_driver.cc:764] failed to alloc 2186673408 bytes unified memory; result: CUDA_ERROR_OUT_OF_MEMORY: out of memory I0724 13:05:58.437325 139963287299840 run_docker.py:180] 2021-07-24 13:05:58.436849: E external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_driver.cc:764] failed to alloc 1968006144 bytes unified memory; result: CUDA_ERROR_OUT_OF_MEMORY: out of memory I0724 13:05:59.023176 139963287299840 run_docker.py:180] 2021-07-24 13:05:59.022710: E external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_driver.cc:764] failed to alloc 1771205632 bytes unified memory; result: CUDA_ERROR_OUT_OF_MEMORY: out of memory I0724 13:05:59.607919 139963287299840 run_docker.py:180] 2021-07-24 13:05:59.607428: E external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_driver.cc:764] failed to alloc 1594085120 bytes unified memory; result: CUDA_ERROR_OUT_OF_MEMORY: out of memory I0724 13:06:00.190412 139963287299840 run_docker.py:180] 2021-07-24 13:06:00.189992: E external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_driver.cc:764] failed to alloc 1434676736 bytes unified memory; result: CUDA_ERROR_OUT_OF_MEMORY: out of memory I0724 13:06:00.775824 139963287299840 run_docker.py:180] 2021-07-24 13:06:00.775335: E external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_driver.cc:764] failed to alloc 1291209216 bytes unified memory; result: CUDA_ERROR_OUT_OF_MEMORY: out of memory I0724 13:06:01.346099 139963287299840 run_docker.py:180] 2021-07-24 13:06:01.345547: E external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_driver.cc:764] failed to alloc 1162088448 bytes unified memory; result: CUDA_ERROR_OUT_OF_MEMORY: out of memory I0724 13:06:01.931352 139963287299840 run_docker.py:180] 2021-07-24 13:06:01.930887: E external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_driver.cc:764] failed to alloc 1045879552 bytes unified memory; result: CUDA_ERROR_OUT_OF_MEMORY: out of memory I0724 13:06:05.746232 139963287299840 run_docker.py:180] 2021-07-24 13:06:05.745756: E external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_driver.cc:764] failed to alloc 3173320448 bytes unified memory; result: CUDA_ERROR_OUT_OF_MEMORY: out of memory I0724 13:06:05.808895 139963287299840 run_docker.py:180] 2021-07-24 13:06:05.808401: E external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_driver.cc:764] failed to alloc 3173320448 bytes unified memory; result: CUDA_ERROR_OUT_OF_MEMORY: out of memory I0724 13:06:15.837229 139963287299840 run_docker.py:180] 2021-07-24 13:06:15.836805: E external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_driver.cc:764] failed to alloc 3173320448 bytes unified memory; result: CUDA_ERROR_OUT_OF_MEMORY: out of memory I0724 13:06:15.873901 139963287299840 run_docker.py:180] 2021-07-24 13:06:15.873463: E external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_driver.cc:764] failed to alloc 3173320448 bytes unified memory; result: CUDA_ERROR_OUT_OF_MEMORY: out of memory I0724 13:06:15.874069 139963287299840 run_docker.py:180] 2021-07-24 13:06:15.873510: W external/org_tensorflow/tensorflow/core/common_runtime/bfc_allocator.cc:457] Allocator (GPU_0_bfc) ran out of memory trying to allocate 192.0KiB (rounded to 196608)requested by op I0724 13:06:15.874446 139963287299840 run_docker.py:180] 2021-07-24 13:06:15.874185: W external/org_tensorflow/tensorflow/core/common_runtime/bfc_allocator.cc:468] **************************************************************************************************** I0724 13:06:15.876205 139963287299840 run_docker.py:180] Traceback (most recent call last): I0724 13:06:15.876372 139963287299840 run_docker.py:180] File "/app/alphafold/run_alphafold.py", line 303, in <module> I0724 13:06:15.876452 139963287299840 run_docker.py:180] app.run(main) I0724 13:06:15.876526 139963287299840 run_docker.py:180] File "/opt/conda/lib/python3.7/site-packages/absl/app.py", line 312, in run I0724 13:06:15.876597 139963287299840 run_docker.py:180] _run_main(main, args) I0724 13:06:15.876669 139963287299840 run_docker.py:180] File "/opt/conda/lib/python3.7/site-packages/absl/app.py", line 258, in _run_main I0724 13:06:15.876739 139963287299840 run_docker.py:180] sys.exit(main(argv)) I0724 13:06:15.876816 139963287299840 run_docker.py:180] File "/app/alphafold/run_alphafold.py", line 256, in main I0724 13:06:15.876888 139963287299840 run_docker.py:180] model_name=model_name, data_dir=FLAGS.data_dir) I0724 13:06:15.876960 139963287299840 run_docker.py:180] File "/app/alphafold/alphafold/model/data.py", line 41, in get_model_haiku_params I0724 13:06:15.877031 139963287299840 run_docker.py:180] return utils.flat_params_to_haiku(params) I0724 13:06:15.877101 139963287299840 run_docker.py:180] File "/app/alphafold/alphafold/model/utils.py", line 79, in flat_params_to_haiku I0724 13:06:15.877170 139963287299840 run_docker.py:180] hk_params[scope][name] = jnp.array(array) I0724 13:06:15.877241 139963287299840 run_docker.py:180] File "/opt/conda/lib/python3.7/site-packages/jax/_src/numpy/lax_numpy.py", line 3044, in array I0724 13:06:15.877312 139963287299840 run_docker.py:180] out = _device_put_raw(object, weak_type=weak_type) I0724 13:06:15.877383 139963287299840 run_docker.py:180] File "/opt/conda/lib/python3.7/site-packages/jax/_src/lax/lax.py", line 1607, in _device_put_raw I0724 13:06:15.877454 139963287299840 run_docker.py:180] return xla.array_result_handler(None, aval)(*xla.device_put(x)) I0724 13:06:15.877524 139963287299840 run_docker.py:180] File "/opt/conda/lib/python3.7/site-packages/jax/interpreters/xla.py", line 146, in device_put I0724 13:06:15.877595 139963287299840 run_docker.py:180] return device_put_handlers[type(x)](x, device) I0724 13:06:15.877666 139963287299840 run_docker.py:180] File "/opt/conda/lib/python3.7/site-packages/jax/interpreters/xla.py", line 154, in _device_put_array I0724 13:06:15.877737 139963287299840 run_docker.py:180] return (backend.buffer_from_pyval(x, device),) I0724 13:06:15.877811 139963287299840 run_docker.py:180] RuntimeError: Resource exhausted: Out of memory while trying to allocate 196608 bytes.
GT710のメモリが1GBしかないのでオーバーフローして止まってしまう。
そこでGPU無しで走らせるにはどうしたらいいか調べるも、決定的な記述に行き当たらない。(そりゃわざわざGPUなしでDLしようなんて酔狂なものはそうはいないだろ)
それらしい記述をスクリプトから探してTrue->Falseに書き換えてみる。
flags.DEFINE_bool('use_gpu', False, 'Enable NVIDIA runtime to run with GPUs.')
$ python3 docker/run_docker.py --fasta_paths=/mnt/fasta/test.fasta --max_template_date=2021-07-24 I0724 14:36:03.430523 140479510705920 run_docker.py:114] Mounting /mnt/ts5400r/rnaseq/fasta -> /mnt/fasta_path_0 I0724 14:36:03.430647 140479510705920 run_docker.py:114] Mounting /mnt/RAID0_8T/alphafold/database/uniref90 -> /mnt/uniref90_database_path I0724 14:36:03.430718 140479510705920 run_docker.py:114] Mounting /mnt/RAID0_8T/alphafold/database/mgnify -> /mnt/mgnify_database_path I0724 14:36:03.430775 140479510705920 run_docker.py:114] Mounting /mnt/RAID0_8T/alphafold/database/uniclust30/uniclust30_2018_08 -> /mnt/uniclust30_database_path I0724 14:36:03.430836 140479510705920 run_docker.py:114] Mounting /mnt/RAID0_8T/alphafold/database/bfd -> /mnt/bfd_database_path I0724 14:36:03.430891 140479510705920 run_docker.py:114] Mounting /mnt/RAID0_8T/alphafold/database/pdb70 -> /mnt/pdb70_database_path I0724 14:36:03.430948 140479510705920 run_docker.py:114] Mounting /mnt/RAID0_8T/alphafold -> /mnt/data_dir I0724 14:36:03.431000 140479510705920 run_docker.py:114] Mounting /mnt/RAID0_8T/alphafold/database/pdb_mmcif -> /mnt/template_mmcif_dir I0724 14:36:03.431055 140479510705920 run_docker.py:114] Mounting /mnt/RAID0_8T/alphafold/database/pdb_mmcif -> /mnt/obsolete_pdbs_path I0724 14:36:07.806614 140479510705920 run_docker.py:180] /opt/conda/lib/python3.7/site-packages/absl/flags/_validators.py:206: UserWarning: Flag --preset has a non-None default value; therefore, mark_flag_as_required will pass even if flag is not specified in the command line! I0724 14:36:07.806749 140479510705920 run_docker.py:180] 'command line!' % flag_name) I0724 14:36:08.187804 140479510705920 run_docker.py:180] I0724 14:36:08.187144 140639490905920 templates.py:837] Using precomputed obsolete pdbs /mnt/obsolete_pdbs_path/obsolete.dat. I0724 14:36:09.709877 140479510705920 run_docker.py:180] I0724 14:36:09.709128 140639490905920 tpu_client.py:54] Starting the local TPU driver. I0724 14:36:09.710193 140479510705920 run_docker.py:180] I0724 14:36:09.709654 140639490905920 xla_bridge.py:214] Unable to initialize backend 'tpu_driver': Not found: Unable to find driver in registry given worker: local:// I0724 14:36:09.710691 140479510705920 run_docker.py:180] 2021-07-24 14:36:09.710255: W external/org_tensorflow/tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64 I0724 14:36:09.710975 140479510705920 run_docker.py:180] 2021-07-24 14:36:09.710307: W external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_driver.cc:269] failed call to cuInit: UNKNOWN ERROR (303) I0724 14:36:09.711264 140479510705920 run_docker.py:180] I0724 14:36:09.710498 140639490905920 xla_bridge.py:214] Unable to initialize backend 'gpu': Failed precondition: No visible GPU devices. I0724 14:36:09.711437 140479510705920 run_docker.py:180] I0724 14:36:09.710731 140639490905920 xla_bridge.py:214] Unable to initialize backend 'tpu': Invalid argument: TpuPlatform is not available. I0724 14:36:09.711596 140479510705920 run_docker.py:180] W0724 14:36:09.710883 140639490905920 xla_bridge.py:217] No GPU/TPU found, falling back to CPU. (Set TF_CPP_MIN_LOG_LEVEL=0 and rerun for more info.) I0724 14:36:14.730792 140479510705920 run_docker.py:180] I0724 14:36:14.730233 140639490905920 run_alphafold.py:261] Have 5 models: ['model_1', 'model_2', 'model_3', 'model_4', 'model_5'] I0724 14:36:14.730963 140479510705920 run_docker.py:180] I0724 14:36:14.730385 140639490905920 run_alphafold.py:273] Using random seed 6370842927624696923 for the data pipeline I0724 14:36:14.732356 140479510705920 run_docker.py:180] I0724 14:36:14.732125 140639490905920 jackhmmer.py:130] Launching subprocess "/usr/bin/jackhmmer -o /dev/null -A /tmp/tmppl2_77oq/output.sto --noali --F1 0.0005 --F2 5e-05 --F3 5e-07 --incE 0.0001 -E 0.0001 --cpu 8 -N 1 /mnt/fasta_path_0/test.fasta /mnt/uniref90_database_path/uniref90.fasta" I0724 14:36:14.762296 140479510705920 run_docker.py:180] I0724 14:36:14.761586 140639490905920 utils.py:36] Started Jackhmmer (uniref90.fasta) query I0724 14:42:12.103540 140479510705920 run_docker.py:180] I0724 14:42:12.101704 140639490905920 utils.py:40] Finished Jackhmmer (uniref90.fasta) query in 357.340 seconds I0724 14:42:12.168576 140479510705920 run_docker.py:180] I0724 14:42:12.168016 140639490905920 jackhmmer.py:130] Launching subprocess "/usr/bin/jackhmmer -o /dev/null -A /tmp/tmphgl1pgx2/output.sto --noali --F1 0.0005 --F2 5e-05 --F3 5e-07 --incE 0.0001 -E 0.0001 --cpu 8 -N 1 /mnt/fasta_path_0/test.fasta /mnt/mgnify_database_path/mgy_clusters.fa" I0724 14:42:12.208694 140479510705920 run_docker.py:180] I0724 14:42:12.208099 140639490905920 utils.py:36] Started Jackhmmer (mgy_clusters.fa) query I0724 14:49:14.646541 140479510705920 run_docker.py:180] I0724 14:49:14.644667 140639490905920 utils.py:40] Finished Jackhmmer (mgy_clusters.fa) query in 422.436 seconds I0724 14:49:17.353891 140479510705920 run_docker.py:180] I0724 14:49:17.353327 140639490905920 hhsearch.py:76] Launching subprocess "/usr/bin/hhsearch -i /tmp/tmp4yri1ytb/query.a3m -o /tmp/tmp4yri1ytb/output.hhr -maxseq 1000000 -d /mnt/pdb70_database_path/pdb70" I0724 14:49:17.381958 140479510705920 run_docker.py:180] I0724 14:49:17.381289 140639490905920 utils.py:36] Started HHsearch query I0724 14:49:17.668585 140479510705920 run_docker.py:180] I0724 14:49:17.667990 140639490905920 utils.py:40] Finished HHsearch query in 0.286 seconds I0724 14:49:17.670548 140479510705920 run_docker.py:180] Traceback (most recent call last): I0724 14:49:17.670647 140479510705920 run_docker.py:180] File "/app/alphafold/run_alphafold.py", line 303, in <module> I0724 14:49:17.670720 140479510705920 run_docker.py:180] app.run(main) I0724 14:49:17.670816 140479510705920 run_docker.py:180] File "/opt/conda/lib/python3.7/site-packages/absl/app.py", line 312, in run I0724 14:49:17.670883 140479510705920 run_docker.py:180] _run_main(main, args) I0724 14:49:17.670947 140479510705920 run_docker.py:180] File "/opt/conda/lib/python3.7/site-packages/absl/app.py", line 258, in _run_main I0724 14:49:17.671009 140479510705920 run_docker.py:180] sys.exit(main(argv)) I0724 14:49:17.671072 140479510705920 run_docker.py:180] File "/app/alphafold/run_alphafold.py", line 285, in main I0724 14:49:17.671134 140479510705920 run_docker.py:180] random_seed=random_seed) I0724 14:49:17.671196 140479510705920 run_docker.py:180] File "/app/alphafold/run_alphafold.py", line 129, in predict_structure I0724 14:49:17.671258 140479510705920 run_docker.py:180] msa_output_dir=msa_output_dir) I0724 14:49:17.671319 140479510705920 run_docker.py:180] File "/app/alphafold/alphafold/data/pipeline.py", line 141, in process I0724 14:49:17.671381 140479510705920 run_docker.py:180] hhsearch_result = self.hhsearch_pdb70_runner.query(uniref90_msa_as_a3m) I0724 14:49:17.671443 140479510705920 run_docker.py:180] File "/app/alphafold/alphafold/data/tools/hhsearch.py", line 87, in query I0724 14:49:17.671504 140479510705920 run_docker.py:180] stdout.decode('utf-8'), stderr[:100_000].decode('utf-8'))) I0724 14:49:17.671566 140479510705920 run_docker.py:180] RuntimeError: HHSearch failed: I0724 14:49:17.671627 140479510705920 run_docker.py:180] stdout: I0724 14:49:17.671689 140479510705920 run_docker.py:180] I0724 14:49:17.671751 140479510705920 run_docker.py:180] I0724 14:49:17.671818 140479510705920 run_docker.py:180] stderr: I0724 14:49:17.671881 140479510705920 run_docker.py:180] - 14:49:17.586 INFO: /tmp/tmp4yri1ytb/query.a3m is in A2M, A3M or FASTA format I0724 14:49:17.671943 140479510705920 run_docker.py:180] I0724 14:49:17.672005 140479510705920 run_docker.py:180] - 14:49:17.587 WARNING: Ignoring invalid symbol '*' at pos. 492 in line 2 of /tmp/tmp4yri1ytb/query.a3m I0724 14:49:17.672068 140479510705920 run_docker.py:180] I0724 14:49:17.672129 140479510705920 run_docker.py:180] - 14:49:17.662 ERROR: [subseq from] Endoglucanase (Fragment) n=2 Tax=Citrus unshiu TaxID=55188 RepID=A0A2H5PNG1_CITUN I0724 14:49:17.672191 140479510705920 run_docker.py:180] - 14:49:17.662 ERROR: Error in /tmp/hh-suite/src/hhalignment.cpp:1244: Compress: I0724 14:49:17.672252 140479510705920 run_docker.py:180] I0724 14:49:17.672313 140479510705920 run_docker.py:180] - 14:49:17.662 ERROR: sequences in /tmp/tmp4yri1ytb/query.a3m do not all have the same number of columns, I0724 14:49:17.672375 140479510705920 run_docker.py:180] I0724 14:49:17.672436 140479510705920 run_docker.py:180] - 14:49:17.662 ERROR: I0724 14:49:17.672498 140479510705920 run_docker.py:180] e.g. first sequence and sequence UniRef90_A0A2H5PNG1/69-549. I0724 14:49:17.672560 140479510705920 run_docker.py:180] I0724 14:49:17.672621 140479510705920 run_docker.py:180] - 14:49:17.662 ERROR: Check input format for '-M a2m' option and consider using '-M first' or '-M 50' I0724 14:49:17.672683 140479510705920 run_docker.py:180] I0724 14:49:17.672744 140479510705920 run_docker.py:180] I0724 14:49:17.672822 140479510705920 run_docker.py:180]
これはこれでまた別のエラーに行き当たるもトラブルシュートできず。
やっぱりRTX2070くらいはないと話にならないんだろうな。しかしそうするには電源、マザボから構築しないと環境がない。
とりあえずはGoogle corabの簡易版サービスでお茶を濁すしかあるまい。
ちなみに
colab.research.google.com
新版がすでに出ている。
こんな感じ。