Skip to main content

Tensorflow tutorial using GPU

Prerequisites

To run this onboarding tutorial, we should first have:

  • a k8saas cluster deployed
info

To ask and set up your own cluster, look at the section Getting Started.

danger

Your cluster must be compatible with GPU. Please ask the support team to enable it using Thales postit portal.

And downloaded the following file:

Run a GPU-enabled workload

note

A Kubernetes Job object creates one or more Pods and will continue to retry execution of the Pods until a specified number of successful completions is reached.

For the tutorial, a Kubernetes Job runs a deep learning algorithm with TensorFlow on the MNIST dataset (a database of handwritten digits).

Deploy the job with the following command:

kubectl apply -f samples-tf-mnist-demo.yaml

View the status and output of the GPU-enabled workload

Use the following commands to check on the job status:

kubectl get jobs samples-tf-mnist-demo --watch

# NAME COMPLETIONS DURATION AGE

# samples-tf-mnist-demo 0/1 3m29s 3m29s
# samples-tf-mnist-demo 1/1 3m10s 3m36s

And get the logs to check on the job's success:

kubectl logs samples-tf-mnist-demo-smnr6

# 2019-05-16 16:08:31.258328: I tensorflow/core/platform/cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA
# 2019-05-16 16:08:31.396846: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Found device 0 with properties:
# name: Tesla K80 major: 3 minor: 7 memoryClockRate(GHz): 0.8235
# pciBusID: 2fd7:00:00.0
# totalMemory: 11.17GiB freeMemory: 11.10GiB
# 2019-05-16 16:08:31.396886: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1120] Creating TensorFlow device (/device:GPU:0) -> (device: 0, name: Tesla K80, pci bus id: 2fd7:00:00.0, compute capability: 3.7)
# 2019-05-16 16:08:36.076962: I tensorflow/stream_executor/dso_loader.cc:139] successfully opened CUDA library libcupti.so.8.0 locally
# Successfully downloaded train-images-idx3-ubyte.gz 9912422 bytes.
# Extracting /tmp/tensorflow/input_data/train-images-idx3-ubyte.gz
# Successfully downloaded train-labels-idx1-ubyte.gz 28881 bytes.
# Extracting /tmp/tensorflow/input_data/train-labels-idx1-ubyte.gz
# Successfully downloaded t10k-images-idx3-ubyte.gz 1648877 bytes.
# Extracting /tmp/tensorflow/input_data/t10k-images-idx3-ubyte.gz
# Successfully downloaded t10k-labels-idx1-ubyte.gz 4542 bytes.
# Extracting /tmp/tensorflow/input_data/t10k-labels-idx1-ubyte.gz
# Accuracy at step 0: 0.1081
# Accuracy at step 10: 0.7457
# Accuracy at step 20: 0.8233
# Accuracy at step 30: 0.8644
# Accuracy at step 40: 0.8848
# Accuracy at step 50: 0.8889
# Accuracy at step 60: 0.8898
# Accuracy at step 70: 0.8979
# Accuracy at step 80: 0.9087
# Accuracy at step 90: 0.9099
# Adding run metadata for 99
# Accuracy at step 100: 0.9125
# Accuracy at step 110: 0.9184
# Accuracy at step 120: 0.922
# Accuracy at step 130: 0.9161
# Accuracy at step 140: 0.9219
# Accuracy at step 150: 0.9151
# Accuracy at step 160: 0.9199
# Accuracy at step 170: 0.9305
# Accuracy at step 180: 0.9251
# Accuracy at step 190: 0.9258
# Adding run metadata for 199
# Accuracy at step 200: 0.9315
# Accuracy at step 210: 0.9361
# Accuracy at step 220: 0.9357
# Accuracy at step 230: 0.9392
# Accuracy at step 240: 0.9387
# Accuracy at step 250: 0.9401
# Accuracy at step 260: 0.9398
# Accuracy at step 270: 0.9407
# Accuracy at step 280: 0.9434
# Accuracy at step 290: 0.9447
# Adding run metadata for 299
# Accuracy at step 300: 0.9463
# Accuracy at step 310: 0.943
# Accuracy at step 320: 0.9439
# Accuracy at step 330: 0.943
# Accuracy at step 340: 0.9457
# Accuracy at step 350: 0.9497
# Accuracy at step 360: 0.9481
# Accuracy at step 370: 0.9466
# Accuracy at step 380: 0.9514
# Accuracy at step 390: 0.948
# Adding run metadata for 399
# Accuracy at step 400: 0.9469
# Accuracy at step 410: 0.9489
# Accuracy at step 420: 0.9529
# Accuracy at step 430: 0.9507
# Accuracy at step 440: 0.9504
# Accuracy at step 450: 0.951
# Accuracy at step 460: 0.9512
# Accuracy at step 470: 0.9539
# Accuracy at step 480: 0.9533
# Accuracy at step 490: 0.9494
# Adding run metadata for 499
info

Tensorflow successfully used the GPU to run the deep learning algorithm in a Kubernetes Job.