JupyterHub with GPU

From HPCwiki
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

Create a jupyterhub instance with GPU support enabled.

setup

Link lustre path to home directory

When working from Jupyterhub the default working directory is the home folder. However, it is recommended to put your data and code on the lustre pathings. To make this easier, we can create a link to lustre from our home directory:

ln -s /lustre/[path to your lustre folder] [reference name, for example lustre_folders]

To remove a link:

rm [reference name, for example lustre_folders]

Create conda environment that we can use for a jupyter kernel

conda create -y -n kernel_test python=3.10 ipykernel 
conda activate kernel_test
python -m ipykernel install --user --name kernel_test

NOTE: You can specific the python version for you conda environment with python=3.10 Please take care what python version is compatible with you required packages.

Install required packages

For pytorch you can find information here and for TensorFlow here.

As an example I use the following pytorch installation:

pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118

Start jupyter notebook with GPU

Go here and select:

  • Select a location for your server: on the cluster (default option)
  • Partition to use: gpu
  • Memory (in MB): desired memory
  • Number of CPUs: desired CPU count
  • Maximum execution time (hours:minutes:seconds): maximum amount of time the notebook is available
  • Extra options: --gres=gpu:1 (default when selecting GPU, gpu:x for x amount of GPUs)

Using multiple GPUs

  • Select multiple GPUs in when starting jupyterhub in the extra options menu: --gres=gpu:x where x is amount of requested GPUs
  • There should be multiple GPUs available to the jupyterhub notebook. Check this by using GPU tests in the following section.

Test GPU availability

Pytorch

def check_all_cuda_devices():
    device_count = torch.cuda.device_count()
    for i in range(device_count):
        print('>>>> torch.cuda.device({})'.format(i))
        result = torch.cuda.device(i)
        print(result, '\n')

        print('>>>> torch.cuda.get_device_name({})'.format(i))
        result = torch.cuda.get_device_name(i)
        print(result, '\n')


def check_cuda():
    print('>>>> torch.cuda.is_available()')
    result = torch.cuda.is_available()
    print(result, '\n')

    print('>>>> torch.cuda.device_count()')
    result = torch.cuda.device_count()
    print(result, '\n')

    print('>>>> torch.cuda.current_device()')
    result = torch.cuda.current_device()
    print(result, '\n')

    print('>>>> torch.cuda.device(0)')
    result = torch.cuda.device(0)
    print(result, '\n')

    print('>>>> torch.cuda.get_device_name(0)')
    result = torch.cuda.get_device_name(0)
    print(result, '\n')

    check_all_cuda_devices()


def check_cuda_ops():
    print('>>>> torch.zeros(2, 3)')
    zeros = torch.zeros(2, 3)
    print(zeros, '\n')

    print('>>>> torch.zeros(2, 3).cuda()')
    cuda_zero = torch.zeros(2, 3).cuda()
    print(cuda_zero, '\n')

    print('>>>> torch.tensor([[1, 2, 3], [4, 5, 6]])')
    tensor_a = torch.tensor([[1, 2, 3], [4, 5, 6]]).cuda()
    print(tensor_a, '\n')

    print('>>>> tensor_a + cuda_zero')
    sum = tensor_a + cuda_zero
    print(sum, '\n')

    print('>>>> tensor_a * cuda_twos')
    tensor_a = tensor_a.to(torch.float)
    cuda_zero = cuda_zero.to(torch.float)
    cuda_twos = (cuda_zero + 1.0) * 2.0
    product = tensor_a * cuda_twos
    print(product, '\n')

    print('>>>> torch.matmul(tensor_a, cuda_twos.T)')
    mat_mul = torch.matmul(tensor_a, cuda_twos.T)
    print(mat_mul, '\n')

try:
    get_version()
except Exception as e:
    print('get_version() failed, exception message below:')
    print(e)

try:
    check_cuda()
except Exception as e:
    print('check_cuda() failed, exception message below:')
    print(e)

try:
    check_cuda_ops()
except Exception as e:
    print('check_cuda_ops() failed, exception message below:')
    print(e)

Tensorflow

import tensorflow as tf

hasGPUSupport = tf.test.is_built_with_cuda()
gpuList = tf.config.list_physical_devices('GPU')

print("Tensorflow Compiled with CUDA/GPU Support:", hasGPUSupport)
print("Tensorflow can access", len(gpuList), "GPU")
print("Accessible GPUs are:")
print(gpuList)

tf.debugging.set_log_device_placement(True)
# Place tensors on the GPU
with tf.device('device:GPU:0'):
  a = tf.constant([[1.0, 2.0, 3.0], [4.0, 5.0, 6.0]])
  b = tf.constant([[1.0, 2.0], [3.0, 4.0], [5.0, 6.0]])

# Run on the GPU
c = tf.matmul(a, b)
print(c)