GPUs on Grid5000: Difference between revisions
| Line 232: | Line 232: | ||
| The toolkit provides the [https://developer.nvidia.com/cublas CUBLAS] library, which is a GPU-accelerated implementation of the BLAS. Documentation about CUBLAS is available [http://docs.nvidia.com/cuda/cublas/index.html here] and several [http://docs.nvidia.com/cuda/cuda-samples/index.html#cudalibraries advanced examples] using CUBLAS are also available in the toolkit distribution (see: simpleCUBLAS, batchCUBLAS, matrixMulCUBLAS, conjugateGradientPrecond...). | The toolkit provides the [https://developer.nvidia.com/cublas CUBLAS] library, which is a GPU-accelerated implementation of the BLAS. Documentation about CUBLAS is available [http://docs.nvidia.com/cuda/cublas/index.html here] and several [http://docs.nvidia.com/cuda/cuda-samples/index.html#cudalibraries advanced examples] using CUBLAS are also available in the toolkit distribution (see: simpleCUBLAS, batchCUBLAS, matrixMulCUBLAS, conjugateGradientPrecond...). | ||
| The regular CUBLAS API (as shown by the simpleCUBLAS example) operates on GPU-allocated arrays, but the toolkit also provides [http://docs.nvidia.com/cuda/nvblas/ NVBLAS], a library that automatically *offload* compute-intensive BLAS3 routines (i.e. matrix-matrix operations) to the GPU. It turns any application that call BLAS routines on the Host to a GPU-accelerated program. In addition, there is no need to recompile the program as NVBLAS can be [ | The regular CUBLAS API (as shown by the simpleCUBLAS example) operates on GPU-allocated arrays, but the toolkit also provides [http://docs.nvidia.com/cuda/nvblas/ NVBLAS], a library that automatically *offload* compute-intensive BLAS3 routines (i.e. matrix-matrix operations) to the GPU. It turns any application that call BLAS routines on the Host to a GPU-accelerated program. In addition, there is no need to recompile the program as NVBLAS can be [https://man7.org/linux/man-pages/man8/ld.so.8.html forcibly linked] using the LD_PRELOAD environment variable. | ||
| To test NVBLAS, you can download and compile our matrix-matrix multiplication example: | To test NVBLAS, you can download and compile our matrix-matrix multiplication example: | ||
Revision as of 21:52, 28 June 2023
|   | Note | 
|---|---|
| This page is actively maintained by the Grid'5000 team. If you encounter problems, please report them (see the Support page). Additionally, as it is a wiki page, you are free to make minor corrections yourself if needed. If you would like to suggest a more fundamental change, please contact the Grid'5000 team. | |
Introduction
This tutorial presents how to use GPU Accelerators. You will learn to reserve these resources, setup the environment and execute codes on the accelerators. Please note that this page is not about GPU programming and only focuses on the specificities of the Grid'5000 platform. In particular, Grid'5000 provides the unique capability to set up your own environment (OS, drivers, compilers...), which is especially useful for testing the latest version of the accelerator software stack (such as the NVIDIA CUDA libraries).
In this tutorial, we provide code examples that use the Level-3 BLAS function DGEMM to compute the product of the two matrices. BLAS libraries are available for a variety of computer architectures (including multicores and accelerators) and this code example is used on this tutorial as a toy benchmark to compare the performance of accelerators and/or available BLAS libraries.
For the purposes of this tutorial, it is assumed that you have a basic knowledge of Grid'5000. Therefore, you should read the Getting Started tutorial first to get familiar with the platform (connections to the platform, resource reservations) and its basic concepts (job scheduling, environment deployment). The Hardware page is useful for locating machines with hardware accelerators and provides details on accelerator models. Node availability may be found using Drawgantt (see Status).
Note that Intel Xeon Phi KNC (MICs) available in Nancy are no longer supported (documentation remains available)
Nvidia GPU on Grid'5000
Note that NVIDIA drivers (see nvidia-smi) and CUDA (nvcc --version) compilation tools are installed by default on nodes. 
Choosing a GPU
Have a look at per-site, detailed hardware pages (for instance, at Lyon), you will find here useful informations about GPUs:
- the card model name (see https://en.wikipedia.org/wiki/List_of_Nvidia_graphics_processing_units to know more about each model)
- the GPU memory size available for computations
- for NVidia GPU, their compute capability
- the hosting node characteristics (#cpu, qty of memory available, #gpus, reservable local disk availability, ...)
- the job access conditions (ie: default or production queue, max walltime partition for clusters in the production queues)
About NVidia and CUDA compatibility with older GPUs
Most of GPU available in Grid'5000 are supported by Nvidia driver and CUDA delivered in Grid'5000 environments. As of October 2021, there are two exceptions:
- K40m GPUs available in grimani cluster in Nancy requires the nvccoption---gpu-architecture=sm_35(35 for compute capability3.5) to be used with CUDA starting from version 11, which is the version shipped with our debian11 environment.
- M2075 GPUs (compute capability 2.0) of the orion cluster in Lyon is not supported by the driver shipped in our environments. GPUs in this cluster are no more usable from our environments and the gpu property used to select a GPU node using oarsub (see below) is disabled. Not that it is still possible for to build an environment with custom driver to use these cards.
See https://en.wikipedia.org/wiki/CUDA#GPUs_supported to know more about the relationship between Cuda versions and compute capability.
Reserving GPUs
Single GPU
If you only need a single GPU in the standard environment, reservation is as simple as:
In Nancy, you have to use the production queue for most of the GPU clusters, for instance:
If you require several GPUs for the same experiment (e.g. for inter-GPU communication or to distribute computation), you can reserve multiple GPUs of a single node:
|   | Note | 
|---|---|
| When you run  | |
To select a specific model of GPU, two possibilities:
use gpu model aliases, as describe in OAR Syntax simplification#GPUs, e.g.
use the "gpu_model" property, e.g.
The exact list of GPU models is available on the OAR properties page, and you can use Hardware page to have an overview of available GPUs on each site.
Reserving full nodes with GPUs
In some cases, you may want to reserve a complete node with all its GPUs. This allows you to customize the software environment with Sudo-g5k or even to deploy another operating system.
To make sure you obtain a node with a GPU, you can use the "gpu_count" property:
In Nancy, you have to use the production queue for most GPU clusters:
To select a specific model of GPU, you can also use the "gpu_model" property, e.g.
If you want to deploy an environment on the node, you should add the -t deploy option.
Note about AMD GPU
As of October 2021, AMD GPUs are available in a single Grid'5000 cluster, neowise, in Lyon. oarsub commands shown above could give you either NVidia or AMD GPUs. The gpu_model property may be used to filter between GPU vendors. For instance:
will filter out Radeon GPUs (=AMD GPUs). See below for more information about AMD GPUs.
GPU usage tutorial
In this section, we will give an example of GPU usage under Grid'5000.
Every steps of this tutorial must be performed on a Nvidia GPU node.
Run the CUDA Toolkit examples
In this part, we are going compile and execute CUDA examples provided by Nvidia using CUDA Toolkit available on the default (standart) environment.
First, we retrieve the version of CUDA installed on the node:
$ nvcc --version nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2021 NVIDIA Corporation Built on Sun_Feb_14_21:12:58_PST_2021 Cuda compilation tools, release 11.2, V11.2.152 Build cuda_11.2.r11.2/compiler.29618528_0
Version is 11.2. We are going to download the corresponding CUDA samples.
cd /tmp git clone --depth 1 --branch v11.2 https://github.com/NVIDIA/cuda-samples.git cd cuda-samples
You can compile all the examples at once by running make:
make -j8
The compilation of all the examples is over when "Finished building CUDA samples" is printed.
Each example is available from its own directory, under Samples root directory (it can also be compiled separately from there).
You can first try the Device Query example located in Samples/deviceQuery/. It enumerates the properties of the CUDA devices present in the system.
/tmp/cuda-samples/Samples/deviceQuery/deviceQuery
Here is an example of the result on the chifflet cluster at Lille:
/tmp/cuda-samples/Samples/deviceQuery/deviceQuery Starting...
 CUDA Device Query (Runtime API) version (CUDART static linking)
Detected 2 CUDA Capable device(s)
Device 0: "GeForce GTX 1080 Ti"
  CUDA Driver Version / Runtime Version          11.2 / 11.2
  CUDA Capability Major/Minor version number:    6.1
  Total amount of global memory:                 11178 MBytes (11721506816 bytes)
  (28) Multiprocessors, (128) CUDA Cores/MP:     3584 CUDA Cores
  GPU Max Clock rate:                            1582 MHz (1.58 GHz)
  Memory Clock rate:                             5505 Mhz
  Memory Bus Width:                              352-bit
  L2 Cache Size:                                 2883584 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
  Maximum Layered 1D Texture Size, (num) layers  1D=(32768), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(32768, 32768), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total shared memory per multiprocessor:        98304 bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  2048
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 2 copy engine(s)
  Run time limit on kernels:                     No
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Disabled
  Device supports Unified Addressing (UVA):      Yes
  Device supports Managed Memory:                Yes
  Device supports Compute Preemption:            Yes
  Supports Cooperative Kernel Launch:            Yes
  Supports MultiDevice Co-op Kernel Launch:      Yes
  Device PCI Domain ID / Bus ID / location ID:   0 / 4 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
Device 1: "GeForce GTX 1080 Ti"
  CUDA Driver Version / Runtime Version          11.2 / 11.2
  CUDA Capability Major/Minor version number:    6.1
  Total amount of global memory:                 11178 MBytes (11721506816 bytes)
  (28) Multiprocessors, (128) CUDA Cores/MP:     3584 CUDA Cores
  GPU Max Clock rate:                            1582 MHz (1.58 GHz)
  Memory Clock rate:                             5505 Mhz
  Memory Bus Width:                              352-bit
  L2 Cache Size:                                 2883584 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
  Maximum Layered 1D Texture Size, (num) layers  1D=(32768), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(32768, 32768), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total shared memory per multiprocessor:        98304 bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  2048
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 2 copy engine(s)
  Run time limit on kernels:                     No
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Disabled
  Device supports Unified Addressing (UVA):      Yes
  Device supports Managed Memory:                Yes
  Device supports Compute Preemption:            Yes
  Supports Cooperative Kernel Launch:            Yes
  Supports MultiDevice Co-op Kernel Launch:      Yes
  Device PCI Domain ID / Bus ID / location ID:   0 / 130 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
> Peer access from GeForce GTX 1080 Ti (GPU0) -> GeForce GTX 1080 Ti (GPU1) : No
> Peer access from GeForce GTX 1080 Ti (GPU1) -> GeForce GTX 1080 Ti (GPU0) : No
deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 11.2, CUDA Runtime Version = 11.2, NumDevs = 2
Result = PASS
BLAS examples
We now run our BLAS example to illustrate GPU performance for dense matrix multiply.
The toolkit provides the CUBLAS library, which is a GPU-accelerated implementation of the BLAS. Documentation about CUBLAS is available here and several advanced examples using CUBLAS are also available in the toolkit distribution (see: simpleCUBLAS, batchCUBLAS, matrixMulCUBLAS, conjugateGradientPrecond...).
The regular CUBLAS API (as shown by the simpleCUBLAS example) operates on GPU-allocated arrays, but the toolkit also provides NVBLAS, a library that automatically *offload* compute-intensive BLAS3 routines (i.e. matrix-matrix operations) to the GPU. It turns any application that call BLAS routines on the Host to a GPU-accelerated program. In addition, there is no need to recompile the program as NVBLAS can be forcibly linked using the LD_PRELOAD environment variable.
To test NVBLAS, you can download and compile our matrix-matrix multiplication example:
You can first check the performance of the BLAS library on the CPU. For small matrix size (<5000), the provided example will compare the BLAS implementation to a naive jki-loop version of the matrix multiplication:
Multiplying Matrices: C(2000x2000) = A(2000x2000) x B(2000x2000) BLAS - Time elapsed: 1.724E+00 sec. J,K,I - Time elapsed: 7.233E+00 sec.
To offload the BLAS computation on the GPU, use:
[NVBLAS] Config parsed Multiplying Matrices: C(2000x2000) = A(2000x2000) x B(2000x2000) BLAS - Time elapsed: 1.249E-01 sec.
Depending on node hardware, GPU might perform better on larger problems:
Multiplying Matrices: C(5000x5000) = A(5000x5000) x B(5000x5000) BLAS - Time elapsed: 2.673E+01 sec.
[NVBLAS] Config parsed Multiplying Matrices: C(5000x5000) = A(5000x5000) x B(5000x5000) BLAS - Time elapsed: 1.718E+00 sec.
If you want to measure the time spent on data transfers to the GPU, have a look to the simpleCUBLAS (/tmp/cuda-samples/Samples/simpleCUBLAS) example and instrument the code with timers.
Custom CUDA version or Nvidia drivers
Here, we explain how to use latest CUDA version with "module", use Nvidia Docker images and install the NVIDIA drivers and compilers before validating the installation on the previous example set.
Older or newer CUDA version using modules
Different CUDA versions can be loaded using "module" command. You should first choose the CUDA toolkit version that you will load with module tool:
------------- /grid5000/spack/v1/share/spack/modules/linux-debian11-x86_64_v2 ---------------- cuda/11.4.0_gcc-10.4.0 cuda/11.6.2_gcc-10.4.0 cuda/11.7.1_gcc-10.4.0 (D)
nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2022 NVIDIA Corporation Built on Tue_Mar__8_18:18:20_PST_2022 Cuda compilation tools, release 11.6, V11.6.124 Build cuda_11.6.r11.6/compiler.31057947_0
You should consult CUDA Toolkit and Compatible Driver Versions to ensure compatibility with a specific Cuda version and the Nvidia GPU driver (for instance, Cuda 11.x toolkit requires a driver version >= 450.80.02)
Copy and compile the sample examples
You now have everything installed. For instance, you can compile and run the toolkit examples (see #Compiling the CUDA Toolkit examples for more information).
You will need to override the CUDA path variable, and also load the matching compiler version from modules:
/grid5000/spack/v1/opt/spack/linux-debian11-x86_64_v2/gcc-10.4.0/cuda-11.6.2-smztrblcyoysrsnrua6jomspxdqxe73e/bin/nvcc
|   | node: | export CUDA_PATH=/grid5000/spack/v1/opt/spack/linux-debian11-x86_64_v2/gcc-10.4.0/cuda-11.6.2-smztrblcyoysrsnrua6jomspxdqxe73e | 
And then you can build and run the examples:
|   | node: | git clone --depth 1 --branch v11.6 https://github.com/NVIDIA/cuda-samples.git /tmp/cuda-samples | 
The newly created environment can be saved with tgz-g5k, to be reused later:
|   | Note | 
|---|---|
| Please note that with some old GPU you might encounter errors when running latest version of CUDA. It's the case with the orion for example | |
Nvidia-docker
A script to install nvidia-docker is available if you want to use Nvidia's images builded for Docker and GPU nodes. This provides an alternative way of making CUDA and Nvidia libraries available to the node. See Nvidia Docker page.
Custom Nvidia driver using deployment
A custom Nvidia driver may be installed on a node if needed. As root privileges are required, we will use kadepoy to deploy a debian11-x64-nfs environment on the GPU node you reserved.
This environment allows you to connect either as root (to be able to install new software) or using your normal Grid'5000 (including access to your home directory). It does not include any NVIDIA or CUDA software, but we are going to install them:
Once the deployment is terminated, you should be able to connect to the node as root:
You can then perform the NVIDIA driver installation:
(warnings about X.Org can safely be ignored)
On the node you can check which NVIDIA drivers are installed with the nvidia-smi tool:
Here is an example of the result on the graphique cluster:
root@graphique-4:~# nvidia-smi
Tue Jun 27 19:37:15 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.82.01    Driver Version: 470.82.01    CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  Off  | 00000000:03:00.0 Off |                  N/A |
| 26%   28C    P0    46W / 180W |      0MiB /  4043MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA GeForce ...  Off  | 00000000:82:00.0 Off |                  N/A |
| 28%   27C    P0    43W / 180W |      0MiB /  4043MiB |      2%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+
If you want to record your environment with the custom NVidia driver, see Advanced_Kadeploy#Create_a_new_environment_from_a_customized_environment
AMD GPU on Grid'5000
As of October 2021, Grid'5000 has one cluster with AMD GPU: neowise cluster in Lyon.
A neowise GPU may be reserved using:
A full neowise node may be reserved using:
The default environment on neowise include part of AMD's ROCm stack with AMD GPU driver and basic tools and libraries such as:
- rocm-smi: get information about GPUs
- hipcc: HIP compiler
- hipfy-perl: CUDA to HIP code converter
In addition, most libraries and development tools from ROCm and HIP (available at https://rocmdocs.amd.com/en/latest/Installation_Guide/Software-Stack-for-AMD-GPU.html) are available as modules. Deep Learning Frameworks pytorch and TensorFlow are also known to work.
