Atom CPU Hyperthreading Benchmark - Roy Longbottom's PC benchmark Collection

Title

Atom CPU Hyperthreading Benchmarks

Introduction

Benchmarks were run on a Netbook that has a single core Intel Atom processor. This has 24 KB L1 data cache, 512 KB L2 cache and 533 MHz single channel DDR2 RAM. This CPU is designed for low power consumption, having 16 stage pipelines (longer than Core CPUs) and in-order instruction issue (compared with out of order on other modern CPUs). Two integer and two floating point arithmetic-logic units are provided (3 or 4 on latest mainstream processors). As could be expected from these limitations, performance at a given CPU MHz will be less than the Core processor line. On the other hand, the Atom has Hyperthreading, where two threads can utilise different parts of the hardware at the same time to enhance performance.

CPUSpeed.htm provides a comparative summary of my single CPU processor benchmarks in terms of %MIPS/MHz and %MFLOPS/MHz, with separate figures for CPU/L1 Cache, L2 Cache and RAM. CPU/L1 cache results show that the 1600 MHz Atom runs at the approximate average equivalent speed of a single Core 2 processor at 1200 MHz for Integer MIPS calculations, 550 MHz for i387 MFLOPS, 750 MHz for SSE MFLOPS and 350 MHz for SSE2 64 bit MFLOPS. L2 results are slightly worse but performance via RAM can be similar as using the same single channel RAM on a Core 2 system.

For 32 bit and 64 bit versions of the following multi-threading programs, the benchmarks and source code can be found in DualCore.zip and NewSource.zip, with both for the fifth benchmark in OpenMPMFLOPS.zip.

 Hardware Information
 CPU GenuineIntel, Features Code BFE9FBFF, Model Code 000106C2
 Intel(R) Atom(TM) CPU N270 @ 1.60GHz Measured 1596 MHz
 Has MMX, Has SSE, Has SSE2, Has SSE3, No 3DNow,
 Windows Information
 Intel processor architecture, 2 CPUs 
 Windows NT Version 5.1, build 2600, Service Pack 3
 Memory 1015 MB, Free 680 MB
 User Virtual Space 2048 MB, Free 2043 MB

The benchmarks produce text log files, besides on screen progress and results. All logs include the above system details. Note that a single processor with Hyperthreading (HT) is identified as having two CPUs.

The first four benchmarks have been modified to use 1, 2, 4, 6 and 8 threads, particularly for the Quad Core i7, that has Hyperthreading and appears to Windows as having 8 CPUs. For further detail and results see Quad Core 8 Thread.htm.

To Start

CPUIDMP

CPUIDMP executes three passes of simple additions to registers attempting to demonstrate maximum CPU speeds. Firstly an integer and an SSE floating point test are run separately. They are then run as two threads of equal priority, where both should run at full speed with 2 CPUs. The benchmark has a third section using four threads with two SSE tests and two integer tests. Results are available in WhatCPU Results.htm.

Both type of instructions appear to nearly fully utilise the pipelines, like producing two integer additions per CPU clock cycle, so there is not much room for improvement with HT. The later calculations indicate a 10% to 12% improvement in throughput.

 
 CPU ID and MP Speed Test 32 bit Version 1.0 Sun Jul 11 13:31:53 2010
 
 Assembled with Microsoft ml.exe Version 6.15.8803
 Speed adding to registers Pass 1 Pass 2 Pass 3 Average Percent
 Separate Tests
 32 bit SSE MFLOPS 5042 5066 5066 5058
 32 bit Integer MIPS 3123 3148 3128 3133
 Two Threads Equal Priority
 32 bit SSE MFLOPS 2995 2998 2994 2996 59%
 32 bit Integer MIPS 1585 1588 1586 1586 51%
 Four Threads, First Normal Priority, Others Normal - 1
 32 bit SSE MFLOPS 3040 3002 3074 3039 60%
 32 bit Integer MIPS 561 608 579 583 19%
 32 bit SSE MFLOPS 1024 751 615 797 16%
 32 bit Integer MIPS 442 571 542 518 17%

To Start

Whets32MP

The Whetstone Benchmark has various routines that execute floating point and integer instructions. The benchmark is run in the main thread and another copy in a low priority second thread which should mainly run at the same speed with two CPUs. When run on a single processor, the second thread receives little or no CPU time. Further results can be found in Whetstone Results.htm.

The second results shown were produced with CPU Affinity flags set to use one processor, where the second thread made no contribution. With HT, the two threads lead to a near doubling of MFLOPS and 33% improvement on VAX MIPS.

The third results are from running the benchmark on one CPU of a 1830 MHz Core 2 Duo. As can be seen, HT on the Atom leads to faster floating point calculations for these particular tests.

 
 Whetstone Single Precision MP SSE Benchmark Sun Jul 11 13:41:18 2010
 MFLOPS Vax MWIPS MFLOPS MFLOPS MFLOPS Cos Exp Fixpt If Equal
 Gmean MIPS 1 2 3 MOPS MOPS MOPS MOPS MOPS
 
 688 6298 1526 704 729 634 27.5 19.8 791 1539 1641
 Thread 1 363 357 311 13.9 10.1 396 771 1240
 Thread 2 341 372 323 13.6 9.68 396 769 401
 Total No HT
 346 4711 820 365 361 314 15.0 10.4 427 1158 1693
 Total one Core 2
 539 10964 1790 630 633 393 42.8 21.9 1458 1403 5153

To Start

BusMP

BusMP starts by reading integer words with 128 byte (32 words) address increments, to indicate memory or cache bus burst reading speeds, then reduces the increment to finally read all words sequentially. The last test reads 128 bits for four 32 bit SSE2 integers. Speed is measured using data in caches in RAM. Results are in BusSpd2K Results.htm.

These results show that performance can be much slower using two threads. In this case each thread has its own full copy of the data. With each requiring 24 KB, data is read from L2 cache instead of L1, and, at 384 KB, data is read from RAM instead of from L2 cache. There is a small gain using L1 cache but nearly 60% via L2 and, as might be expected, speed from RAM is slightly slower. Worst case is half speed with 24 KB data. When two threads are run with Affinity set to use one CPU (No HT), total speed of two threads is the same as that for one thread.

 
 MP Bus Speed Test 32 bit Version 1.2 Sun Jul 11 13:23:16 2010
 
 Part 1 - Single Thread MBytes/Second
 Kbytes Inc32wds Inc16wds Inc8wds Inc4wds Inc2wds ReadAll 128bSSE2
 6 4453 5223 5331 5633 5665 5566 23245
 24 3439 3649 3415 4424 5046 5265 14800
 96 462 394 735 1360 2380 3525 5504
 384 431 386 712 1351 2280 3455 5364
 768 126 223 462 936 1777 3115 3747
 1536 115 220 442 866 1712 3004 3538
 16380 102 207 409 828 1621 2988 3272
 131070 103 204 414 814 1644 2945 3310
 Part 2 - Two Threads Total MBytes/Second
 Average
 Kbytes Inc32wds Inc16wds Inc8wds Inc4wds Inc2wds ReadAll 128bSSE2 % of 1
 6 5065 5679 5802 5959 5965 5913 23804 107%
 24 696 758 1350 2301 3644 4562 9151 50%
 96 645 744 1307 2243 3642 4540 8957 159%
 384 111 230 477 962 1962 3628 3744 69%
 768 102 209 419 833 1690 3340 3328 ]
 1536 102 208 415 815 1645 2794 3069 ]
 16380 97 204 415 826 1673 3308 3259 ] 
 131070 101 208 418 829 1672 3311 3294 ] 97%
 For 32 bit MIPS divide MB/Second by 4. SSE2 divide by 16 for 128 bit MIPS

To Start

RandMP32

The program uses the same code for serial and random use via a complex indexing structure and comprises Read (RD) and Read/Write (RW) tests. They are run to use data from L1 cache, L2 cache and RAM. Results are in RandMem Results.htm. With two threads, each has its own code and use the same data but the second thread starts at the half way point.

Below the logged results, the percentage improvements of using two threads are shown. Random access is particularly slow in terms of MB/second, where all transmitted data is not used when burst reading is involved. The greater than two times improvement might be due to achieving a higher hit rate on cached data.

The last percentages are for a Core 2 Duo CPU where speeds via L1 cache can be much slower with two threads, reading and writing from/to two L1 caches. This can be put down to Windows flushing caches to maintain data coherency when sharing the same data array. This effect is also apparent on larger data sizes, to some extent, on dual core CPUs that do not use shared L2 caches.

 RandMP Write/Read Test 32 bit Version 1.1 Sun Jul 11 13:36:02 2010
 ------------------ MBytes Per Second At --------------------
 6 KB 24 KB 96 KB 384 KB 768 KB 1536 KB 12 MB 96 MB
 1 Thread
 Serial RD 3718 3797 2802 2803 2040 2048 2030 2032
 Serial RW 1946 2252 1902 1902 1415 1186 1120 1112
 Random RD 3412 3869 802 489 126 77 55 45
 Random RW 1840 2265 822 516 185 110 80 68
 2 Threads
 Serial RD1 2444 2823 2248 2283 1618 1612 1593 1584
 Serial RD2 2425 2783 2204 2261 1592 1573 1575 1563
 Serial RW1 1823 2141 1722 1712 1328 1082 851 723
 Serial RW2 1762 2090 1706 1696 1209 1022 712 576
 Random RD1 2631 2858 633 408 120 72 52 42
 Random RD2 2602 2819 619 401 119 71 52 41
 Random RW1 1797 2087 657 424 205 137 66 80
 Random RW2 1790 2078 649 414 202 137 64 80
 End of test Sun Jul 11 13:36:54 2010
 For approximate speed in MIPS divide MBytes/Second by 3.2 
 6 KB 24 KB 96 KB 384 KB 768 KB 1536 KB 12 MB 96 MB
 2 Thread % of 1
 Serial RD 131% 148% 159% 162% 157% 156% 156% 155%
 Serial RW 184% 188% 180% 179% 179% 177% 140% 117%
 Random RD 153% 147% 156% 165% 190% 186% 189% 184%
 Random RW 195% 184% 159% 162% 220% 249% 163% 235%
 Example Dual Core CPU
 Serial RD 205% 196% 191% 193% 191% 189% 184% 184%
 Serial RW 47% 46% 181% 181% 182% 184% 132% 148%
 Random RD 200% 197% 166% 163% 162% 164% 143% 195%
 Random RW 17% 17% 80% 142% 148% 158% 169% 201%

To Start

OpenMPMFLOPS

OpenMP is a system independent set of procedures and software that arranges automatic parallel processing of shared memory data when more than one processor is provided. This option is available in the latest Microsoft C++ compilers. Results are in OpenMP MFLOPS.htm. The arithmetic operations executed are of the form x[i] = (x[i] + a) * b - (x[i] + c) * d + (x[i] + e) * f with 2, 8 or 32 operations per input data word. Three data sizes are used, 0.1M, 1M and 10M words or 0.4M, 4M and 40M Bytes. In the case of the Atom, virtually all calculations will involve accessing RAM.

The first results shown below were produced with CPU Affinity flags set to use one processor (No HT), with the second ones when using Hyperthreading. HT performance gains are in the range 167% to 185%. Although these gains are excellent, relative performance to a single Core 2 Duo is not very good, with the latter having much larger caches and better arithmetic pipeline arrangements. For example, all Atom HT results are much slower than one CPU of a 1830 MHz laptop Core 2 Duo, the latter being two to more than three times faster.

 32 Bit OpenMP MFLOPS Benchmark 1 Sun Jul 11 13:53:33 2010 - No HT
 Via Microsoft C/C++ Optimizing Compiler Version 15.00.30729.207 for 80x86
 Test 4 Byte Ops/ Repeat Seconds MFLOPS First All
 Words Word Passes Results Same
 Data in & out 100000 2 2500 2.895631 173 0.929475 Yes
 Data in & out 1000000 2 250 2.688429 186 0.992543 Yes
 Data in & out 10000000 2 25 2.654217 188 0.999249 Yes
 Data in & out 100000 8 2500 7.871326 254 0.957164 Yes
 Data in & out 1000000 8 250 7.621398 262 0.995525 Yes
 Data in & out 10000000 8 25 7.593257 263 0.999550 Yes
 Data in & out 100000 32 2500 21.799048 367 0.890377 Yes
 Data in & out 1000000 32 250 21.626982 370 0.988102 Yes
 Data in & out 10000000 32 25 21.598040 370 0.998799 Yes
 ******************************************************************************
 32 Bit OpenMP MFLOPS Benchmark 1 Sun Jul 11 13:52:14 2010 - With HT
 Via Microsoft C/C++ Optimizing Compiler Version 15.00.30729.207 for 80x86
 Test 4 Byte Ops/ Repeat Seconds MFLOPS First All
 Words Word Passes Results Same
 Data in & out 100000 2 2500 1.546320 323 0.929475 Yes
 Data in & out 1000000 2 250 1.608478 311 0.992543 Yes
 Data in & out 10000000 2 25 1.512073 331 0.999249 Yes
 Data in & out 100000 8 2500 4.345824 460 0.957164 Yes
 Data in & out 1000000 8 250 4.519152 443 0.995525 Yes
 Data in & out 10000000 8 25 4.209331 475 0.999550 Yes
 Data in & out 100000 32 2500 11.787812 679 0.890377 Yes
 Data in & out 1000000 32 250 11.788433 679 0.988102 Yes
 Data in & out 10000000 32 25 11.808370 677 0.998799 Yes

To Start

Other Benchmarks

Disk Speed - The 5400 RPM hard disk in the Netbook operates at a reasonable speed DiskGraf Results.htm (jAtoLap), showing 34 MB/second writing and 47 MB/second reading. BMPSpeed Results.htm (AtomM) shows image writing speeds of up to 39 MB/second, aided by caching, and reading/formatting at up to 36 MB/second.

Intel 945GSE Graphics - Using BMPSpeed edit, rotate, save and load average speeds of around 1 second as demonstrating suitability for editing images, the Netbook appears to be adequate for up to 32 MB, as good as many Pentium 4 based PCs see - BMPSpeed Results.htm 32 MB. Image scrolling speed, at this size, is also super fast at 1.9 milliseconds per full screen window or 323 M Pixels/second.

Results for system Atom 1, at 32 bits colour settings, show that the Netbook is as good as many Pentium 4 systems, running Windows and DirectDraw 2D benchmarks. As might be expected, the Netbook is not very fast on 3D applications, but it will run some of them. Atom 1, 32 bits results for Direct3D show that maximum Frames Per Second (FPS) are similar to those of a laptop with Intel X3100 graphics (at same screen pixel size), but can be faster on tests that are more dependent on CPU speed. The PC can also run DirectX 9 applications, with PixelShader 2, but not Vertex Shader 2, again possibly faster than X3100. The same arguments apply to OpenGL applications. It should be noted that performance is generally better than PCs of the Petium II era.

To Start

Roy Longbottom August 2010

The Internet Home for my PC Benchmarks is via the link
Roy Longbottom's PC Benchmark Collection

Atom CPU Hyperthreading Benchmarks

Contents

Introduction

CPUIDMP

Whets32MP

BusMP

RandMP32

OpenMPMFLOPS

Other Benchmarks