Background

I was recently experimenting with some models implemented in TensorFlow 1.x. However, when trying to run them on a machine with CUDA 10.1, it seems to be having some trouble locating the libcu*.so.10.0 files.

2020-05-09 00:33:15.100129: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcudart.so.10.0'; dlerror: libcudart.so.10.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64
2020-05-09 00:33:15.100216: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcublas.so.10.0'; dlerror: libcublas.so.10.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64
2020-05-09 00:33:15.100317: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcufft.so.10.0'; dlerror: libcufft.so.10.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64
2020-05-09 00:33:15.100408: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcurand.so.10.0'; dlerror: libcurand.so.10.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64
2020-05-09 00:33:15.100470: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcusolver.so.10.0'; dlerror: libcusolver.so.10.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64
2020-05-09 00:33:15.100541: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcusparse.so.10.0'; dlerror: libcusparse.so.10.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64
2020-05-09 00:33:15.120235: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2020-05-09 00:33:15.120269: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1641] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform.

This article provides some information on the compatibility. The table below is taken from said page.

tf_compat.png

We can sort of assume that TF1.15 is not going to work with CUDA 10.1. Hence, we will have to build TF1.15 ourselves.

Installing Prerequisites

To build TensorFlow 1.15, we begin with the development docker image nvidia/cuda:10.1-cudnn7-devel-ubuntu18.04, which we can run with the following command:

docker run --gpus all \
        -v tensorflow_build:/mnt \
        -v tmp:/root \
        --shm-size=8G \
        -it nvidia/cuda:10.1-cudnn7-devel-ubuntu18.04 \
        bash

We mount two directories, one for storing the build files, and the other to which bazel cache files will be written to. Under the default configuration, the container had 10GB allocated, and the directory /root/.cache/bazel can take upwards of 6GB, which could easily exceed the limit set by the container and the install would fail.

Inside the container, we first install all the required packages:

apt update
apt install -y python3 python3-pip python3-dev git unzip

We can then clone the TensorFlow repository into the directory we mounted, then we can checkout r1.15.

cd /mnt
git clone https://github.com/tensorflow/tensorflow.git
cd tensorflow
git checkout r1.15

Then, we download and install bazel.

BAZEL_VERSION="0.26.0"
wget https://github.com/bazelbuild/bazel/releases/download/${BAZEL_VERSION}/bazel-${BAZEL_VERSION}-installer-linux-x86_64.sh
chmod +x bazel-${BAZEL_VERSION}-installer-linux-x86_64.sh
./bazel-${BAZEL_VERSION}-installer-linux-x86_64.sh

Part of the building process requires the python binary, but installing python3 doesn’t give us that. To solve this, we symlink the binary as follows:

ln -s /usr/bin/python3 /usr/bin/python

Building TensorFlow

Now, we can start the actual building process. We first run ./configure inside the tensorflow directory. Now, it will ask us if we want CUDA support, and to which we answer yes. We can leave all the other options as default.

root@2be0159ae22a:/mnt/tensorflow# ./configure
....
Do you wish to build TensorFlow with CUDA support? [y/N]: y
CUDA support will be enabled for TensorFlow.

At this point, the configuration script should find the paths to the cuda libraries:

Found CUDA 10.1 in:
    /usr/local/cuda/lib64
    /usr/local/cuda/include
Found cuDNN 7 in:
    /usr/lib/x86_64-linux-gnu
    /usr/include

Now, to begin the actual compilation, we run the following command:

bazel build --config=opt --config=cuda //tensorflow/tools/pip_package:build_pip_package

The build should begin now. For me, on a 16-core EPYC machine, this process took around 6 hours. The length of this duration might differ greatly based on your hardware configurations.

When the compilation completes, you should see something like this:

Target //tensorflow/tools/pip_package:build_pip_package up-to-date:
  bazel-bin/tensorflow/tools/pip_package/build_pip_package
INFO: Elapsed time: 21506.854s, Critical Path: 1549.28s
INFO: 23749 processes: 23749 local.
INFO: Build completed successfully, 30612 total actions

Generating the whl file

According to the official documentation, bazel creates a binary called build_pip_package, which we can call now to generate the whl file.

./bazel-bin/tensorflow/tools/pip_package/build_pip_package /tmp/tensorflow_pkg

Our whl file should be created inside /tmp/tensorflow_pkg. We can now copy this file elsewhere or install it on our system.

You can test its functionality with the tf.test.is_gpu_available function in TensorFlow.

At this point, we are done. You can install this whl file with:

pip install tensorflow*.whl