Skip to content

TensorFlow does not see all available GPUs in my system #252

@lu4

Description

@lu4

System information

  • Have I written custom code (as opposed to using a stock example script provided in TensorFlow): N/A
  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Ubuntu 16.0.4 LTS x86_64
  • TensorFlow installed from (source or binary): source
  • TensorFlow version (use command below): ComputeCpp-v0.6.0-4212-gb29ac8a 1.8.0-rc1
  • Python version: 3.5.2
  • Bazel version (if compiling from source): 0.15.0, build timestamp - 1530015019
  • GCC/Compiler version (if compiling from source): 5.4.0 20160609
  • CUDA/cuDNN version: N/A
  • GPU model and memory: Sapphire Radeon RX470, 8Gbytes
  • Exact command to reproduce: see below
from tensorflow.python.client import device_lib
device_lib.list_local_devices()

Here's info from environment capture script:

== cat /etc/issue ===============================================
Linux custom 4.16.0-rc6-smos+ #1 SMP Wed Mar 21 13:23:56 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
VERSION="16.04.3 LTS (Xenial Xerus)"
VERSION_ID="16.04"
VERSION_CODENAME=xenial

== are we in docker =============================================
No

== compiler =====================================================
c++ (Ubuntu 5.4.0-6ubuntu1~16.04.10) 5.4.0 20160609
Copyright (C) 2015 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.


== uname -a =====================================================
Linux custom 4.16.0-rc6-smos+ #1 SMP Wed Mar 21 13:23:56 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

== check pips ===================================================
numpy               1.14.5
protobuf            3.6.0
tensorflow          1.8.0rc1

== check for virtualenv =========================================
False

== tensorflow import ============================================
Traceback (most recent call last):
  File "<string>", line 1, in <module>
ImportError: No module named tensorflow

== env ==========================================================
LD_LIBRARY_PATH /usr/local/lib:/usr/local/computecpp/lib:/usr/local/lib:/usr/local/computecpp/lib:
DYLD_LIBRARY_PATH is unset

== nvidia-smi ===================================================
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.


== cuda libs  ===================================================
/usr/local/lib/libcudart.so.9.0.103

== cat /etc/issue ===============================================
Linux custom 4.16.0-rc6-smos+ #1 SMP Wed Mar 21 13:23:56 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
VERSION="16.04.3 LTS (Xenial Xerus)"
VERSION_ID="16.04"
VERSION_CODENAME=xenial

== are we in docker =============================================
No

== compiler =====================================================
c++ (Ubuntu 5.4.0-6ubuntu1~16.04.10) 5.4.0 20160609
Copyright (C) 2015 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.


== uname -a =====================================================
Linux custom 4.16.0-rc6-smos+ #1 SMP Wed Mar 21 13:23:56 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

== check pips ===================================================
numpy               1.14.5
protobuf            3.6.0
tensorflow          1.8.0rc1

== check for virtualenv =========================================
False

== tensorflow import ============================================
tf.VERSION = 1.8.0-rc1
tf.GIT_VERSION = b'ComputeCpp-v0.6.0-4212-gb29ac8a'
tf.COMPILER_VERSION = b'ComputeCpp-v0.6.0-4212-gb29ac8a'
Sanity check: array([1], dtype=int32)

== env ==========================================================
LD_LIBRARY_PATH /usr/local/lib:/usr/local/computecpp/lib:/usr/local/lib:/usr/local/computecpp/lib:
DYLD_LIBRARY_PATH is unset

== nvidia-smi ===================================================
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.


== cuda libs  ===================================================
/usr/local/lib/libcudart.so.9.0.103

Describe the problem

Tensorflow built on top of SYCL refuses to list and use all available GPUs in the system. I'm using the following commands to get list of devices:

(please note that TensorFlow's in-line log presents 8 devices, but the actual resulting variable contains just two CPU and one GPU available through "/device:SYCL:0" name)

>>> from tensorflow.python.client import device_lib
>>> device_lib.list_local_devices()
2018-07-21 14:21:08.328612: I ./tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2
2018-07-21 14:21:09.308907: I ./tensorflow/core/common_runtime/sycl/sycl_device.h:70] Found following OpenCL devices:
2018-07-21 14:21:09.308981: I ./tensorflow/core/common_runtime/sycl/sycl_device.h:72] id: 0, type: GPU, name: Ellesmere, vendor: Advanced Micro Devices, Inc., profile: FULL_PROFILE
2018-07-21 14:21:09.309001: I ./tensorflow/core/common_runtime/sycl/sycl_device.h:72] id: 1, type: GPU, name: Ellesmere, vendor: Advanced Micro Devices, Inc., profile: FULL_PROFILE
2018-07-21 14:21:09.309019: I ./tensorflow/core/common_runtime/sycl/sycl_device.h:72] id: 2, type: GPU, name: Ellesmere, vendor: Advanced Micro Devices, Inc., profile: FULL_PROFILE
2018-07-21 14:21:09.309034: I ./tensorflow/core/common_runtime/sycl/sycl_device.h:72] id: 3, type: GPU, name: Ellesmere, vendor: Advanced Micro Devices, Inc., profile: FULL_PROFILE
2018-07-21 14:21:09.309052: I ./tensorflow/core/common_runtime/sycl/sycl_device.h:72] id: 4, type: GPU, name: Ellesmere, vendor: Advanced Micro Devices, Inc., profile: FULL_PROFILE
2018-07-21 14:21:09.309068: I ./tensorflow/core/common_runtime/sycl/sycl_device.h:72] id: 5, type: GPU, name: Ellesmere, vendor: Advanced Micro Devices, Inc., profile: FULL_PROFILE
2018-07-21 14:21:09.309085: I ./tensorflow/core/common_runtime/sycl/sycl_device.h:72] id: 6, type: GPU, name: Ellesmere, vendor: Advanced Micro Devices, Inc., profile: FULL_PROFILE
2018-07-21 14:21:09.309101: I ./tensorflow/core/common_runtime/sycl/sycl_device.h:72] id: 7, type: GPU, name: Ellesmere, vendor: Advanced Micro Devices, Inc., profile: FULL_PROFILE
[name: "/device:CPU:0"
device_type: "CPU"
memory_limit: 268435456
locality {
}
incarnation: 911408516298923653
, name: "/device:SYCL:0"
device_type: "SYCL"
memory_limit: 268435456
locality {
}
incarnation: 161138719697210983
physical_device_desc: "id: 0, type: GPU, name: Ellesmere, vendor: Advanced Micro Devices, Inc., profile: FULL_PROFILE"
]

I confirm that all devices are functional and available to OpenCL (visible to clinfo) and are operable through another 3-rd party package (ArrayFire). Also I confirm that SYCL itself sees all available devices, in order to test that purpose I've updated SYCL's 'custom-device-selector' example to following code:

/***************************************************************************
 *
 *  Copyright (C) 2016 Codeplay Software Limited
 *  Licensed under the Apache License, Version 2.0 (the "License");
 *  you may not use this file except in compliance with the License.
 *  You may obtain a copy of the License at
 *
 *      http://www.apache.org/licenses/LICENSE-2.0
 *
 *  For your convenience, a copy of the License has been included in this
 *  repository.
 *
 *  Unless required by applicable law or agreed to in writing, software
 *  distributed under the License is distributed on an "AS IS" BASIS,
 *  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 *  See the License for the specific language governing permissions and
 *  limitations under the License.
 *
 *  Codeplay's ComputeCpp SDK
 *
 *  custom-device-selector.cpp
 *
 *  Description:
 *    Sample code that shows how to write a custom device selector in SYCL.
 *
 **************************************************************************/

#include <CL/sycl.hpp>
#include <iostream>

using namespace cl::sycl;
using namespace std;

/* Classes can inherit from the device_selector class to allow users
 * to dictate the criteria for choosing a device from those that might be
 * present on a system. This example looks for a device with SPIR support
 * and prefers GPUs over CPUs. */
class custom_selector : public device_selector {
 public:
  custom_selector() : device_selector() {}

  /* The selection is performed via the () operator in the base
   * selector class.This method will be called once per device in each
   * platform. Note that all platforms are evaluated whenever there is
   * a device selection. */
  int operator()(const device& device) const override {
    cout << device.get_info<cl::sycl::info::device::vendor>() << ": " << device.get_info<cl::sycl::info::device::name>() << std::endl; // << "(" << device.get_info<cl::sycl::info::device::device_type>() << ")"
    cout << '\t' << "max_work_group_size : " << device.get_info<cl::sycl::info::device::max_work_group_size>() << std::endl;
//    cout << '\t' << "max_work_item_sizes : " << device.get_info<cl::sycl::info::device::max_work_item_sizes>() << std::endl;
    cout << '\t' << "max_compute_units   : " << device.get_info<cl::sycl::info::device::max_compute_units>() << std::endl;
    cout << '\t' << "local_mem_size      : " << device.get_info<cl::sycl::info::device::local_mem_size>() << std::endl;
    cout << '\t' << "max_mem_alloc_size  : " << device.get_info<cl::sycl::info::device::max_mem_alloc_size>() << std::endl;
    cout << '\t' << "profile             : " << device.get_info<cl::sycl::info::device::profile>() << std::endl;
    cout << "----------------------------------------------------------------------------------------------" <<  std::endl << std::endl << std::endl;

    /* We only give a valid score to devices that support SPIR. */
    if (device.has_extension(cl::sycl::string_class("cl_khr_spir"))) {
      if (device.get_info<info::device::device_type>() ==
          info::device_type::cpu) {
        return 50;
      }

      if (device.get_info<info::device::device_type>() ==
          info::device_type::gpu) {
        return 100;
      }
    }
    /* Devices with a negative score will never be chosen. */
    return -1;
  }
};

int main() {
  const int dataSize = 64;
  int ret = -1;
  float data[dataSize] = {0.f};

  range<1> dataRange(dataSize);
  buffer<float, 1> buf(data, dataRange);

  /* We create an object of custom_selector type and use it
   * like any other selector. */
  custom_selector selector;
  queue myQueue(selector);

  myQueue.submit([&](handler& cgh) {
    auto ptr = buf.get_access<access::mode::read_write>(cgh);

    cgh.parallel_for<class example_kernel>(dataRange, [=](item<1> item) {
      size_t idx = item.get_linear_id();
      ptr[item.get_linear_id()] = static_cast<float>(idx);
    });
  });

  /* A host accessor can be used to force an update from the device to the
   * host, allowing the data to be checked. */
  accessor<float, 1, access::mode::read_write, access::target::host_buffer>
      hostPtr(buf);

  if (hostPtr[10] == 10.0f) {
    ret = 0;
  }

  return ret;
}

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions