富士通PRIMERGY RX300S7で深層学習の学習環境を構築してみる（２）

続き

次にanacondaを入れる
最初普通にanacondaのサイトからインストーラをダウンロードしてきて

$ bash Anaconda3-5.3.1-Linux-x86_64.sh

とインストールしてtensorflowをpipで入れて・・・とやってみたのだが、glibcのバージョンがCentOS7では2.17、tensorflowが2.23を要求するということからドツボにはまり、glibcを別途用意してLD_LIBRARY_PATHで指定して、とか頑張ってみたのだが、今ひとつすっきりしないし、systemを破壊しそうでナニだったことから、anaconda自体をpyenvで隔離することにした。

anaconda はpyenvのもとにインストールし、systemのpython2.7もしくはpython3とは切り離しておく

$ git clone https://github.com/yyuu/pyenv.git ~/.pyenv
$ echo 'export PYENV_ROOT="$HOME/.pyenv"' >> ~/.bashrc
$ echo 'export PATH="$PYENV_ROOT/bin:$PATH"' >> ~/.bashrc
$ echo 'eval "$(pyenv init -)"' >> ~/.bashrc
$ source ~/.bashrc

anacondaは最新版をインストール

$ pyenv install -l | grep anaconda

$ pyenv install anaconda3-5.3.1
$ pyenv rehash

$ pyenv global anaconda3-5.3.1 # anacondaをpythonのデフォルトに設定
$ echo 'export PATH="$PYENV_ROOT/versions/anaconda3-5.3.1/bin/:$PATH"' >> ~/.bashrc
$ source ~/.bashrc

$ conda update conda

CUDA toolkitとcudnnはcondaからインストールできるらしい

$ conda install cudatoolkit
$ conda install cudnn

tensorflow-gpuのインストールなのだが、どうもpython3.7では動かないらしい
TensorFlowをPython3で使う準備をする（Windows10） : としおの読書生活

というわけでanacondaで使うpythonを3.6に落とす。

$ conda install python=3.6

condaを使ってtensorflowをインストール

$ conda install tensorflow-gpu

pythonを起動してtensorflowがちゃんと入ったか確認してみる

$ python
Python 3.6.8 |Anaconda, Inc.| (default, Dec 30 2018, 01:22:34) 
[GCC 7.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from tensorflow.python.client import device_lib
>>> device_lib.list_local_devices()
2019-04-18 09:27:07.652574: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX
2019-04-18 09:27:07.786044: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2000084999 Hz
2019-04-18 09:27:07.792889: I tensorflow/compiler/xla/service/service.cc:150] XLA service 0x564bc4bd80c0 executing computations on platform Host. Devices:
2019-04-18 09:27:07.793038: I tensorflow/compiler/xla/service/service.cc:158]   StreamExecutor device (0): <undefined>, <undefined>
2019-04-18 09:27:08.049509: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:998] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-04-18 09:27:08.050256: I tensorflow/compiler/xla/service/service.cc:150] XLA service 0x564bc4cc3680 executing computations on platform CUDA. Devices:
2019-04-18 09:27:08.050307: I tensorflow/compiler/xla/service/service.cc:158]   StreamExecutor device (0): GeForce GT 710, Compute Capability 3.5
2019-04-18 09:27:08.050639: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1433] Found device 0 with properties: 
name: GeForce GT 710 major: 3 minor: 5 memoryClockRate(GHz): 0.954
pciBusID: 0000:03:00.0
totalMemory: 980.94MiB freeMemory: 958.69MiB
2019-04-18 09:27:08.050682: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1512] Adding visible gpu devices: 0
2019-04-18 09:27:08.060169: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-04-18 09:27:08.060206: I tensorflow/core/common_runtime/gpu/gpu_device.cc:990]      0 
2019-04-18 09:27:08.060226: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1003] 0:   N 
2019-04-18 09:27:08.060495: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/device:GPU:0 with 733 MB memory) -> physical GPU (device: 0, name: GeForce GT 710, pci bus id: 0000:03:00.0, compute capability: 3.5)
[name: "/device:CPU:0"
device_type: "CPU"
memory_limit: 268435456
locality {
}
incarnation: 2232806157292722847
, name: "/device:XLA_CPU:0"
device_type: "XLA_CPU"
memory_limit: 17179869184
locality {
}
incarnation: 16136033629533297001
physical_device_desc: "device: XLA_CPU device"
, name: "/device:XLA_GPU:0"
device_type: "XLA_GPU"
memory_limit: 17179869184
locality {
}
incarnation: 1820044920749343627
physical_device_desc: "device: XLA_GPU device"
, name: "/device:GPU:0"
device_type: "GPU"
memory_limit: 769327104
locality {
  bus_id: 1
  links {
  }
}
incarnation: 12295195253859629574
physical_device_desc: "device: 0, name: GeForce GT 710, pci bus id: 0000:03:00.0, compute capability: 3.5"
]

ちゃんと認識した

さて、環境が整ったっぽいのでkerasも入れてテストしてみよう
kerasのインストール

$ pip install keras

kerasのリポジトリをgit cloneしてmnist_cnn.pyというプログラムを走らせてみる

$ git clone https://github.com/fchollet/keras.git
$ cd keras/examples
$ python mnist_cnn.py
Using TensorFlow backend.
Downloading data from https://s3.amazonaws.com/img-datasets/mnist.npz
11493376/11490434 [==============================] - 5s 0us/step
x_train shape: (60000, 28, 28, 1)
60000 train samples
10000 test samples
WARNING:tensorflow:From /home/kkuro2/.pyenv/versions/anaconda3-5.3.1/lib/python3.6/site-packages/tensorflow/python/framework/op_def_library.py:263: colocate_with (from tensorflow.python.framework.ops) is deprecated and will be removed in a future version.
Instructions for updating:
Colocations handled automatically by placer.
WARNING:tensorflow:From /home/kkuro2/.pyenv/versions/anaconda3-5.3.1/lib/python3.6/site-packages/keras/backend/tensorflow_backend.py:3445: calling dropout (from tensorflow.python.ops.nn_ops) with keep_prob is deprecated and will be removed in a future version.
Instructions for updating:
Please use `rate` instead of `keep_prob`. Rate should be set to `rate = 1 - keep_prob`.
WARNING:tensorflow:From /home/kkuro2/.pyenv/versions/anaconda3-5.3.1/lib/python3.6/site-packages/tensorflow/python/ops/math_ops.py:3066: to_int32 (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.cast instead.
Train on 60000 samples, validate on 10000 samples
Epoch 1/12
2019-04-18 09:35:51.539905: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX
2019-04-18 09:35:51.551448: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2000084999 Hz
2019-04-18 09:35:51.552561: I tensorflow/compiler/xla/service/service.cc:150] XLA service 0x56298dfd6eb0 executing computations on platform Host. Devices:
2019-04-18 09:35:51.552623: I tensorflow/compiler/xla/service/service.cc:158]   StreamExecutor device (0): <undefined>, <undefined>
2019-04-18 09:35:51.683502: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:998] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-04-18 09:35:51.684277: I tensorflow/compiler/xla/service/service.cc:150] XLA service 0x56298e0c23d0 executing computations on platform CUDA. Devices:
2019-04-18 09:35:51.684383: I tensorflow/compiler/xla/service/service.cc:158]   StreamExecutor device (0): GeForce GT 710, Compute Capability 3.5
2019-04-18 09:35:51.684947: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1433] Found device 0 with properties: 
name: GeForce GT 710 major: 3 minor: 5 memoryClockRate(GHz): 0.954
pciBusID: 0000:03:00.0
totalMemory: 980.94MiB freeMemory: 958.69MiB
2019-04-18 09:35:51.685056: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1512] Adding visible gpu devices: 0
2019-04-18 09:35:51.686590: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-04-18 09:35:51.686655: I tensorflow/core/common_runtime/gpu/gpu_device.cc:990]      0 
2019-04-18 09:35:51.686696: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1003] 0:   N 
2019-04-18 09:35:51.687099: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 733 MB memory) -> physical GPU (device: 0, name: GeForce GT 710, pci bus id: 0000:03:00.0, compute capability: 3.5)
2019-04-18 09:35:57.798268: I tensorflow/stream_executor/dso_loader.cc:152] successfully opened CUDA library libcublas.so.10.0 locally
  128/60000 [..............................] - ETA: 1:30:39 - loss: 2.2983 - acc  256/60000 [..............................] - ETA: 45:50 - loss: 2.2962 - acc: 

(中略）

59776/60000 [============================>.] - ETA: 0s - loss: 0.0267 - acc: 0.959904/60000 [============================>.] - ETA: 0s - loss: 0.0269 - acc: 0.960000/60000 [==============================] - 76s 1ms/step - loss: 0.0271 - acc: 0.9917 - val_loss: 0.0344 - val_acc: 0.9881
Test loss: 0.03438230963442984
Test accuracy: 0.9881

という感じにテスト完了。ちゃんと機能しているっぽい。

epoch=20の学習によって損失値が
loss: 2.2962
loss: 0.0271
と下がっていることがわかる。

めでたしめでたし

batch_size = 128
num_classes = 10
epochs = 12

GPUあり00:15:28

GPU有り無しでどの程度違うのかはテストしてみないとな
ただ、GT710のメモリ１Gではちょっと足らないっぽいな。Running low on GPU memoryって警告が出っぱなしで、相当足を引っ張ってたっぽい
f:id:k-kuro:20190419101052p:plain

追記
ちなみにCPUの方は6core/12threadあるけど、1コア分しか働いてないね。サーバ本体のメモリは4GBしか積んでないけど、64％使用程度なので、十分らしい。
f:id:k-kuro:20190419101752p:plain

あえてGPUを使わずCPUだけを使って行うときは

$ export CUDA_VISIBLE_DEVICES=""

というふうにする

epoch=1でテストランしてみたところ
GPUあり：79秒
GPUなし：108秒
36％の高速化（ｗ
と流石にローエンドGPUだけあって差はそんなもんか、というレベルだった。
＃こりゃ手持ちの6core/12thread x 2のサーバでCPUだけでやったほうが速いんちゃう？
GPUメモリ不足が致命的なのかな。batch_sizeを下げるとかチューニングが必要なのかも

なお、GPUなしで走らせると
f:id:k-kuro:20190419105829p:plain
確かにCPUがフル回転している