Technology Reviews: Nvidia Tesla K80


The traditional high-performance computing (HPC) system uses CPUs working in parallel for computationally intensive tasks. This kind of system is including our primary server of Chalawan, castor. For the castor, we have 19 compute nodes with totally 496 cores of CPU plus two Intel Xeon Phi cards (see node configuration). The Intel Xeon Phi cards have been used as coprocessors that help us to speed up the computational tasks. However, we have learned that using the coprocessor is challenging and the application that supports is really limited. Some of the hard-coding users which have the skill in parallel programming may gain great benefit using this coprocessor, but it is not the big portion of our users. Therefore, we looked for another system which contains the graphics processing units (GPUs) used as the accelerators. The GPUs have been rapidly developed for many years to provide the more general purposes in computational tasks than the coprocessors. The GPUs are used in mobile devices, embedded systems, personal computer, HPC system, etc. Moreover, there are plenty of supported applications.

NVIDIA Tesla K80 Specifications

In June 2018, A&A Neo Technology Co., Ltd had lent us the server, Gigabyte G190-H44, containing the 2 cards of Tesla K80, for a few months. Based on Kepler architecture, Tesla K80 is released in late 2014 by Nvidia to be the top performance GPU during that time. Nowadays, it was preceded by three architecture models, Maxwell, Pascal and Volta. Tesla K80 combines two graphics processor units, GK210, to increase performance. Each GPU has 12 GB GDDR5 memory which connected with the 384-bit memory interface. The memory has ECC (error-correcting code) enabled by default which will preserve a few percents of available memory. The leftover total memory for users will be about 22.5 GB. Tesla K80 connected to the motherboard with dual slots PCI-Express 3.0 and has no display connector. More information is shown in the table below:

 Tesla K80
GPUs2 x GK210
Stream Processors2 x 2496 cores
Core Clock560 MHz (can be boosted up to 875 MHz)
Memory Clock5 GHz GDDR5
Memory Bandwidth2 x 384-bit
Memory Size2 x 12 GB
InterfacePCI Express 3.0 x16
Thermal SolutionPassive

Gigabyte G190-H44 is 1U size server which can supports up to four GPU cards.

System Preparation

First of all, we installed CentOS 7.4 on the two nearline SAS harddisks with RAID0 configuration. After system update completed, the machine was rebooted to runlevel 3. In this mode, we can simply install NVIDIA driver and CUDA toolkit, then we rebooted it again so that the new kernel can be loaded. We ran the command, nvidia-smi, to make sure the driver installation is completed. Unfortunately, the command result showed only two out of four GPUs. We suspected that would be the problem with redundancy configuration. So, we opened up the server and moved one of the GPU cards from the front pane to back pane. After reboot, the command finally showed the proper information.

NVIDIA Tesla K80 GPU Accelerator

Benchmark with TensorFlow

We planned to test this GPU node with the applications that used by our users and the application they are going to use, such as Quantum Espresso (GPU edition)DSPSR, and TensorFlow. We then realised that we are not familiar with the GPU system yet there are so many things to learn. Because of this, we can test only one application: TensorFlow. TensorFlow is an open source machine learning framework by Google. It can be built to support MPI, OpenCL and CUDA. In this test, we try to maximize performance by compiling TensorFlow with CUDA from source. The benchmark was carried out by running the scripts to train the widely accepted image classification models on a bare-metal server and another one on docker container. Next, we will compare the results to the other results from the Google Compute Engine and Amazon Elastic Compute Cloud (Amazon EC2).


ResNet-50, ResNet-152, VGG16 and AlexNet models were tested using synthetic data from tf.variable. The tests were run on Gigabyte G190-H44 on physical server and another one on the docker container. As expected, the results on physical server (the grey graph) show the highest performance due to the lowest computational overhead possible. The docker image on our cluster (the blue graph) is a little bit lower and still perform better than the Google compute engine and Amazon EC2. However, the tests are biased since the baseline tests were made with older version of TensorFlow (1.1.0rc2) and our tests were made by TensorFlow 1.8. In the next post, testing on NVIDIA Tesla V100, we will show the performance gain by running TensorFlow 1.9 compare to TensorFlow 1.1.0rc2 and the tests with other softwares.

Details of Testing (NVIDIA® Tesla® K80)

Batch size and optimizer information

Batch size per GPU646451264


  • Instance type: Gigabyte G190-H44
  • GPU: 2x NVIDIA® Tesla® K80
  • OS: CentOS 7.4 with tests run on bare-metal
  • CUDA / cuDNN: 9.1 / 7.1
  • TensorFlow version: 1.8
  • Build Command: bazel build -c opt --copt=-march="core-avx2" --config=cuda --config=mkl --copt="-DEIGEN_USE_VML" //tensorflow/tools/pip_package:build_pip_package /tmp/tensorflow_pkg
  • Disk: RAID0 nearline SAS 7200rpm (2 combined)
  • DataSet: ImageNet


Training: Tesla K80 on physical server
Speedup: Tesla K80 on physical server


  • Instance type: Gigabyte G190-H44
  • GPU: 2x NVIDIA® Tesla® K80
  • OS: Ubuntu 16.04 LTS with tests run via Docker
  • CUDA / cuDNN: 9.0 / 7.0
  • TensorFlow version: 1.8
  • Build Command: bazel build -c opt --copt=-march="haswell" --config=cuda
  • Disk: RAID0 nearline SAS 7200rpm (2 combined)
  • DataSet: ImageNet


Training: Tesla K80 on container server
Speedup: Tesla K80 on container server