CUDA nVidia Graphics Processor Parallel Computing Benchmarks Via Linux- Roy Longbottom's PC benchmark Collection

Title

Roy Longbottom at Linkedin Linux CUDA GPU Parallel Computing Benchmarks

General Example Results Log Installing Software

Compiling/Running Programs Comparative Results Burn-In Tests

General

This exercise involved installing 32-bit and 64-bit CUDA software on eSATA and USB drives used for compiling programs and running them on various PCs.

CUDA, from nVidia, provides programming functions to use GeForce graphics processors for general purpose computing. These functions are easy to use in executing arithmetic instructions on numerous processing elements simultaneously. Maximum speeds, in terms of billions on floating point operations per second or GFLOPS, can be higher on a laptop graphics processor than such as dual core CPUs. This is for Single Instruction Multiple Data (SIMD) operation, where the same instructions can be executed simultaneously on sections of data from a data array. For maximum speeds, the data array has to be large and with little or no references to graphics or host CPU RAM. To assist in this, CUDA hardware provides a large number of registers and high speed cache like memory.

The benchmarks measure floating point speeds in Millions of Floating Point Operations Per Second (MFLOPS). They demonstrates some best and worst case performance using varying data array size and increasing processing instructions per data access. The benchmarks use nVidia CUDA programming functions that only execute on their graphics hardware and compatible driver. There are five scenarios:

New Calculations - Copy data to graphics RAM, execute instructions, copy back
to host RAM [Data in & out]

Update Data - Execute further instructions on data in graphics RAM, copy
back to host RAM [Data out only]

Graphics Only Data - Execute further instructions on data in graphics RAM, leave
it there [Calculate only]

Extra Test 1 - Just graphics data, repeat loop in CUDA function [Calculate]

Extra Test 2 - Just graphics data, repeat loop in CUDA function but using
Shared Memory [Shared Memory]

These are run at three different data sizes, defaults 100,000 words repeated 2500 times, 1M words 250 times and 10M words 25 times. The arithmetic operations executed are of the form x[i] = (x[i] + a) * b - (x[i] + c) * d + (x[i] + e) * f with 2, 8 or 32 adds or subtracts and multiplies on each data element. The Extra Tests are only run using 10M words repeated 25 times.

The main benchmark code is as used for Windows versions, as described in CUDA1.htm, CUDA2.htm and CUDA3 x64.htm, where further technical descriptions and comparative results are provided.

Four versions were produced via Ubuntu Linux, with 32-Bit and 64-Bit compilations using Single and Double Precision floating point numbers. The execution files, source code along with compiling and running instructions, can be downloaded in linux_cuda_mflops.tar.gz. The benchmarks are simple execution files and do not need installing. The first ones run in a Terminal window via the normal ./name command. Details are displayed when the tests are running and performance results are save in a .txt file. Details of other Linux benchmarks can be found in linux benchmarks.htm.

To Start

Example Results Log

Following is an example log file of the 64-Bit Single Precision version running on a 3 GHz AMD CPU with GeForce GTS 250 graphics. Some of the CUDA programming code is rather strange. So it was felt necessary to check that all array elements had been used, as reflected in the last two columns. The data checking also lead to including parameters to use the programs for burn-in/reliability tests (see later). Note that maximum speed shown here is nearly 172 GFLOPS.

 #####################################################
 Assembler CPUID and RDTSC 
 CPU AuthenticAMD, Features Code 178BFBFF, Model Code 00100F42 
 AMD Phenom(tm) II X4 945 Processor 
 Measured - Minimum 3000 MHz, Maximum 3000 MHz 
 Linux Functions 
 get_nprocs() - CPUs 4, Configured CPUs 4 
 get_phys_pages() and size - RAM Size 7.81 GB, Page Size 4096 Bytes 
 uname() - Linux, roy-AMD4, 2.6.35-22-generic 
 #35-Ubuntu SMP Sat Oct 16 20:45:36 UTC 2010, x86_64 
 #####################################################
 Linux CUDA 3.2 x64 32 Bits SP MFLOPS Benchmark 1.4 Wed Dec 29 15:35:35 2010
 CUDA devices found 
 Device 0: GeForce GTS 250 with 16 Processors 128 cores 
 Global Memory 999 MB, Shared Memory/Block 16384 B, Max Threads/Block 512
 Using 256 Threads
 Test 4 Byte Ops Repeat Seconds MFLOPS First All
 Words /Wd Passes Results Same
 Data in & out 100000 2 2500 1.035893 483 0.9295383095741 Yes
 Data out only 100000 2 2500 0.514445 972 0.9295383095741 Yes
 Calculate only 100000 2 2500 0.082464 6063 0.9295383095741 Yes
 Data in & out 1000000 2 250 0.706176 708 0.9925497770309 Yes
 Data out only 1000000 2 250 0.380928 1313 0.9925497770309 Yes
 Calculate only 1000000 2 250 0.051266 9753 0.9925497770309 Yes
 Data in & out 10000000 2 25 0.639933 781 0.9992496371269 Yes
 Data out only 10000000 2 25 0.339051 1475 0.9992496371269 Yes
 Calculate only 10000000 2 25 0.041672 11999 0.9992496371269 Yes
 Data in & out 100000 8 2500 1.013196 1974 0.9569796919823 Yes
 Data out only 100000 8 2500 0.490317 4079 0.9569796919823 Yes
 Calculate only 100000 8 2500 0.088028 22720 0.9569796919823 Yes
 Data in & out 1000000 8 250 0.666709 3000 0.9955092668533 Yes
 Data out only 1000000 8 250 0.351320 5693 0.9955092668533 Yes
 Calculate only 1000000 8 250 0.052704 37948 0.9955092668533 Yes
 Data in & out 10000000 8 25 0.620265 3224 0.9995486140251 Yes
 Data out only 10000000 8 25 0.335467 5962 0.9995486140251 Yes
 Calculate only 10000000 8 25 0.044453 44992 0.9995486140251 Yes
 Data in & out 100000 32 2500 1.057142 7568 0.8900792598724 Yes
 Data out only 100000 32 2500 0.531691 15046 0.8900792598724 Yes
 Calculate only 100000 32 2500 0.128706 62157 0.8900792598724 Yes
 Data in & out 1000000 32 250 0.688714 11616 0.9880728721619 Yes
 Data out only 1000000 32 250 0.375411 21310 0.9880728721619 Yes
 Calculate only 1000000 32 250 0.075172 106423 0.9880728721619 Yes
 Data in & out 10000000 32 25 0.644074 12421 0.9987990260124 Yes
 Data out only 10000000 32 25 0.357000 22409 0.9987990260124 Yes
 Calculate only 10000000 32 25 0.062001 129029 0.9987990260124 Yes
 Extra tests - loop in main CUDA Function
 Calculate 10000000 2 25 0.050288 9943 0.9992496371269 Yes
 Shared Memory 10000000 2 25 0.009206 54313 0.9992496371269 Yes
 Calculate 10000000 8 25 0.049608 40316 0.9995486140251 Yes
 Shared Memory 10000000 8 25 0.017254 115916 0.9995486140251 Yes
 Calculate 10000000 32 25 0.050531 158320 0.9987990260124 Yes
 Shared Memory 10000000 32 25 0.046626 171580 0.9987990260124 Yes

To Start

Installing Software

There are uncertaincies when installing CUDA. In my case, I am using Ubuntu 10.10 but the only current choice is CUDA Toolkit 3.2 on Ubuntu 10.04 Linux. This Tutorial provides detailed information on how to do it. To stand the best chance of working, the first step is to download the compatible nVidia graphics driver - Developer Drivers for Linux (260.19.26) with 32-bit and 64-bit versions, downloaded from nVidia’s Linux Developer Zone. Later, down the page, there are links to download 32-bit and 64-bit versions of CUDA Toolkit for Ubuntu Linux 10.04 and GPU Computing SDK code samples. It is advisable to install the latter to show that the software runs properly on existing hardware. This also requires the installation of an OpenGL driver as indicated in the Tuturial.

Before installing the new graphics driver, the system might need reconfiguring to use the basic graphics driver. The Tutorial also provides details of the of a script file to blacklist various functions, particularly nouveau. In my case, this did not work, but the installer took care of it. The Tutorial commands shown to install the driver required amending as 260.19.26. In my case with the 64-bit version, on rebooting and unlike the basic driver, the nVidia software did not detect the correct the correct monitor pixel settings. Resetting the default meant that initial settings were incorrect when the appropriate USB drive was used on another PC.

The sample program source codes are in a /C/src directory and all compiled with a single make command, with execution files saved in /C/bin/linux/release.

To Start

Compiling and Running Programs

Each sample programs has its own make file. The easiest way to compile a new program is to create a new folder in /src to contain source code and associated files. Then copy and modify a make file from another directory in /src. An example makefile is shown below, the .o entries being standard object files to obtain PC configuration details. Later, required includes for the nvcc compile command were determined, also shown below. In this case, the execution file appears in the same directory as the source files and this can be in a more convenient location.

Compiling the benchmarks under Windows required different nvcc path directives for producing 32-bit and 64-bit execution files and separate run time DLL library files had to be included for redistribution, but these did not need path settings. Using Linux, the same simple nvcc format can be used for both bit size compilations but separate library files (libcudart.so.3) are needed for redistribution. With this in the execution file directory, the library is made accessible using the export command shown below.

Both the 32-bit and 64-bit benchmarks could be compiled and run using the nVidia 260.19.26 graphics driver with GeForce 8600 GT and GeForce GTS 250 graphics cards on different PCs. The 32-bit and 64-bit nVidia graphics driver versions 260.19.06, recommended by Ubuntu System - Administration - Additional Drivers, were installed on other USB drives and these ran the benchmarks successfully on the same PCs.

Although Windows versions of the benchmarks ran successfully wherever tried, the Linux varieties failed sometimes with the error message ?cudaSafeCall() Runtime API error : all CUDA-capable devices are busy or unavailable?, particularly on a laptop. This occurred randomly on using 32-bit and 64-bit Ubuntu and both drivers. In this case, CUDA can produce device statistics but will not run most provided sample programs. It is as though the hardware is stuck in a bad state but can mainly be overcome by powering off/on and rebooting. Googling shows that this type of error is quite common, including via Windows.

 #########################################################################
 Makefile
 # Add source files here
 EXECUTABLE	 := cudamflops
 # Cuda source files (compiled with cudacc)
 CUFILES		 := cudaMFLOPS1.cu
 # CUDA dependency files
 CU_DEPS		 :=
 # C/C++ source files (compiled with gcc/c++)
 CCFILES		 := cpuida64.o cpuidc64.o
 # Rules and Targets
 include ../../common/common.mk
 #########################################################################
 Compile and Link Command
 nvcc cudaMFLOPS1.cu -I ~/NVIDIA_GPU_Computing_SDK/C/common/inc 
 -I ~/NVIDIA_GPU_Computing_SDK/shared/inc
 cpuida64.o cpuidc64.o -o cudamflops
 #########################################################################
 Set Library File Path
 export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/

To Start

Comparative Results

The GTS 250 is on a 16x PCI-E bus whose maximum speed is 4 GB/second or 1G 32 bit words per second. The Data Out Only tests provide 2, 8 or 32 single precision floating point operations per word. Dividing the faster MFLOPS speeds by these numbers indicates a data transfer speed of around 700 MW or 2.8 GB per second, the bus speed clearly being the limiting factor. With data in and out speed is reduced by half, as would be expected. Maximum data size is 40 MB which can easily be accommodated in the available graphics RAM, but speed of the latter will affect Calculate and Shared Memory Tests.

Video RAM throughput is 70.4 GB or 17.6 G words per second. With data in and out, 8.8 Gw/second can be assumed. Multiplying this by 2, 8 and 32 operations indicates maximum speeds of 17600, 70400 and 281600 MFLOPS. Actual results of Calculate tests are up to 70% of these speeds. Shared memory tests use a somewhat faster cache. Maximum speed specification of the GTS 250 is 1.836 GHz x 128 processors x 3 (linked multiply, add and multiply) or 705 GFLOPS but this would be difficult to sustain with memory access.

Results below are for GeForce 8600 GT graphics on a Core 2 Duo based PC and GeForce GTS 250 on a 3.0 GHz Quad Core Phenom based system. The 64b and 32b keys refer to benchmarks compiled to run on 64 or 32 bit versions of Operating System. The first and fifth columns are for tests compiled for Windows using CUDA Toolkit 3.1 and the following set is for Linux using 3.2 Toolkit. Some Linux speeds are faster but this might be due to the toolkit version. The other Single Precision (SP) and Double Precision (DP) results are typical of 64-bit and 32-bit compilations where, as the tests mainly depend on the speed using graphics processors, performance is similar when using the same precision. Compilations for 32 bit tests produced the same speeds when run via 32-Bit and 64-Bit versions of Ubuntu and the alternative graphics driver made no difference. Of particular note, DP speed (shown below) is much slower than that for SP calculations.

Later results are for faster cards running at a maximum of up to 790 GFLOPS. Added 2014 results are for a mid range GeForce GTX 650, with a 3.7 GHz Core i7, via Windows 8.1 and Ubuntu 14.04. A maximum 412 GFLOPS was demonstrated, making it more than twice as fast as a more expensive GTS 250, from three years earlier. The i7 Asus P9X79 LE motherboard has PCI Express 3.0 x 16 which, along with faster RAM and CPU GHz, produces the fastest speeds, so far, where data in and out, or out only, are involved. Earlier systems probably had PCIe 1 with maximum bandwidth is 4 GB/s, or PCIe 2 at 8 GB/s, compared with 15.74 GB/s for PCIe 3.


 Speed In MFLOPS
 GTS250 GTS250 8600 8600 GTS250 GTS250 8600 8600
 Win Linux Linux Linux Win Linux Linux Linux
 Test 100K Words x 3.1 SP 3.2 SP 3.2 SP 3.2 SP 3.1 DP 3.2 DP 3.2 DP 3.2 DP
 Ops x Passes 64b 64b 64b 32b 64b 64b 64b 32b

 Data in & out 1x2x2500 347 496 266 265 201 238 116 115
 Data out only 1x2x2500 751 979 462 458 352 395 180 179
 Calculate only 1x2x2500 2990 6030 2797 2739 960 1155 496 493
 Data in & out 10x2x250 605 714 390 388 255 297 155 154
 Data out only 10x2x250 1118 1312 632 629 393 463 215 215
 Calculate only 10x2x250 9989 9809 3529 3546 1109 1125 457 455
 Data in & out 100x2x25 680 796 445 441 255 309 165 163
 Data out only 100x2x25 1248 1469 694 691 407 483 225 223
 Calculate only 100x2x25 12881 11935 3943 3913 1127 1147 463 476
 Data in & out 1x8x2500 1331 1906 1062 1053 792 999 460 458
 Data out only 1x8x2500 2955 4086 1827 1811 1380 1649 715 711
 Calculate only 1x8x2500 11685 22809 10174 9928 3892 4588 1962 1956
 Data in & out 10x8x250 2428 3075 1547 1537 1057 1264 616 614
 Data out only 10x8x250 4562 5834 2499 2480 1692 2037 849 847
 Calculate only 10x8x250 38792 38811 13109 13056 4429 4517 1795 1790
 Data in & out 100x8x25 2856 3174 1764 1750 1075 1241 649 644
 Data out only 100x8x25 5144 5901 2726 2722 1758 1939 872 873
 Calculate only 100x8x25 51550 49304 14481 14588 4562 4591 1791 1786
 Data in & out 1x32x2500 5895 7332 3902 3857 3306 3971 1823 1815
 Data out only 1x32x2500 10687 15111 6356 6245 5496 6499 2818 2801
 Calculate only 1x32x2500 38843 62060 22228 21261 15087 17780 7544 7446
 Data in & out 10x32x250 9152 11828 5586 5553 4040 5063 2439 2429
 Data out only 10x32x250 16849 21985 8505 8448 6770 8184 3338 3331
 Calculate only 10x32x250 108303 104855 27363 27226 18091 18220 6892 6911
 Data in & out 100x32x25 10792 12274 6293 6243 4451 4990 2548 2534
 Data out only 100x32x25 19033 22096 9215 9170 7102 7783 3421 3411
 Calculate only 100x32x25 135130 117655 29034 29177 18495 18640 6892 6930
 Extra tests - loop in main CUDA Function
 Calculate 100x2x25 10044 10021 3825 3825 965 943 443 445
 Shared Memory 100x2x25 54062 52088 10710 10717 37286 37414 8940 8947
 Calculate 100x8x25 40262 40233 15144 15199 3761 3862 1772 1777
 Shared Memory 100x8x25 119569 117125 23384 23591 106938 107308 21113 21117
 Calculate 100x32x25 158195 158537 31317 31309 15079 15430 7149 7226
 Shared Memory 100x32x25 172911 171721 34033 34108 163780 164046 32243 32139

 Corei7 Corei7 Phenom Corei7 Corei7
 2.8GHz 2.8GHz 3.0GHz 3.7GHz 3.7GHz
 GTX580 GTX580 GTX570 GTX650 GTX650
 Win Win Linux Win Linux
 Test 100K Words x 3.1 DP 3.1 SP 3.2 SP 3.2 SP 3.2 SP
 Ops x Passes 64b 64b 64b 64b 64b

 Data in & out 1x2x2500 299 511 403 459 597
 Data out only 1x2x2500 557 936 1010 1059 1283
 Calculate only 1x2x2500 4875 4084 13882 3449 5834
 Data in & out 10x2x250 455 832 654 893 1133
 Data out only 10x2x250 922 1704 1245 1790 2183
 Calculate only 10x2x250 14072 18791 29634 8806 9666
 Data in & out 100x2x25 505 991 709 980 1355
 Data out only 100x2x25 973 1934 1269 1852 2485
 Calculate only 100x2x25 18085 31348 34996 10530 10411
 Data in & out 1x8x2500 1162 1939 1949 2375 2823
 Data out only 1x8x2500 2178 3511 4068 4151 5152
 Calculate only 1x8x2500 16481 14109 45930 13056 21679
 Data in & out 10x8x250 1823 3305 2708 3545 4178
 Data out only 10x8x250 3707 6784 5502 7107 8651
 Calculate only 10x8x250 53651 79136 113472 34014 37138
 Data in & out 100x8x25 2059 3970 2839 3896 5396
 Data out only 100x8x25 4011 7762 4938 7283 9882
 Calculate only 100x8x25 70715 122580 138639 40905 40599
 Data in & out 1x32x2500 4330 7181 7835 9183 11034
 Data out only 1x32x2500 7771 12760 15013 15769 19628
 Calculate only 1x32x2500 35705 37085 109243 43975 70679
 Data in & out 10x32x250 6913 13191 11043 14006 16069
 Data out only 10x32x250 13692 26278 20691 27684 30597
 Calculate only 10x32x250 93702 212808 375859 120972 133042
 Data in & out 100x32x25 7896 15775 10925 15499 21283
 Data out only 100x32x25 14766 30816 20591 28906 38528
 Calculate only 100x32x25 112501 414020 510582 147100 146204
 Extra tests - loop in main
 Calculate 100x2x25 50860 80702 88987 26876 27613
 Shared Memory 100x2x25 80755 142312 160615 77049 64308
 Calculate 100x8x25 103214 262176 289222 81484 79671
 Shared Memory 100x8x25 110153 386225 426749 181190 229241
 Calculate 100x32x25 121398 585688 650986 216570 219797
 Shared Memory 100x32x25 121878 709577 790930 400966 412070

To Start

Burn-In/Reliability Tests

The program has run time parameters to vary the number threads, words and repeat passes. Details are provided in CUDA1.htm. This also details other parameters available to run a reliability/burn-in test. These are running time in minutes and seconds between logged results, default 15 seconds. Calculate is the default routine but the Shared Memory test can be used with an added parameter of FC. Results are checked for consistent values and performance measured.

Below are results of a 10 minute test, where GPU temperature and fan speed are also shown, as provided by nVidia X Server Statistics for Thermal Settings via System, Preferences, Monitors. CPU utilisation was also noted from System Monitor and, surprisingly, showed the equivalent of two of the four Phenom cores running flat out. On running this test, it is advisable to set power saving options to “Never? and five second reporting would be more appropriate to minimise the time that the display is frozen.

Roy Longbottom at Linkedin Roy Longbottom January 2015

The Internet Home for my PC Benchmarks is via the link
Roy Longbottom's PC Benchmark Collection

Roy Longbottom at Linkedin Linux CUDA GPU Parallel Computing Benchmarks

Contents

General

Example Results Log

Installing Software

Compiling and Running Programs

Comparative Results

Burn-In/Reliability Tests

Roy Longbottom at Linkedin Roy Longbottom January 2015