Raspberry Pi 4B 32 Bit Benchmarks

JavaDraw Benchmark OpenGL Benchmark Stress Tests

Summary

Previously, I have run my 32 bit and 64 bit benchmarks on the appropriate range of Raspberry Pi computers, up to model 3B+. Details of the benchmarks, results and download links are available in Raspberry Pi 3B+ 32 bit and 64 bit Benchmarks and stress tests.htm. This early opportunity to run the programs was due to my acceptance of the request for me to become a volunteer consultant, exercising the system prior to launch. This report contains brief reminders of the benchmarks, with 32 bit results on the new Raspberry Pi 4 using Raspbian Buster Operating System. Existing benchmarks were used to provide comparisons with the old 3B+ model. The benchmarks were also recompiled using gcc 8, that came with Buster, to provide further comparisons. The benchmarks and results are summarised as follows.

Single Core CPU Tests - comprising Whetstone, Dhrystone, Linpack and Livermore Loops Classic Benchmarks. Compared with a Pi 4B/Pi 3B+ CPU MHz ratio of 1.07, the overall performance gains for these four programs increased to around 1.8, 2.0, 4.0 and 2.8 times, with some further improvements between 1.05 and 1.26 from gcc 8 compilations.

Single Core Memory Benchmarks - measuring performance using data from caches and RAM. These include eight different measurements of FFTs, at 11 increasing sizes, with average Pi 4B speed gains of 3.26 times. BusSpeed was intended to identify maximum reading speeds, where there was not much difference from L1 cache, some gain via L2 cache and 80% from RAM, increasing by a further 25% using the gcc 8 compilation. MemSpeed and NeonSpeed carry out floating point and integer calculations, providing Pi 4B speed gains at all levels, best with double precision floating point calculations at greater than five times.

Multithreading Benchmarks - Most of the multithreading benchmarks execute the same calculations using 1, 2, 4 and 8 threads. The first are for Whetstone, Dhrystone and Linpack benchmarks, providing similar Pi 4B gains as the single core versions, with only Whetstones providing effective four core performance.

Various multithreaded and OpenMP cache/RAM benchmarks were run, these mainly demonstrating the sort of code that is good and bad for efficient MP utilisation. Most demonstrated appropriate single core Pi 4B performance gains, but with some other relationships totally confusing.

Finally, a number of benchmarks attempt to measure maximum MFLOPS floating point speed, using the same series of calculations, with variants covering single and double precision (SP and DP), vector intrinsic functions and OpenMP. Best DP performance was 10.4 GFLOPS with SP at 19.9 GFLOPS. Highest Pi 4B/Pi 3B+ gains were 6.69 times DP and 5.15 times SP. The gcc 8 compilations provided some improvement in speed.

Java and OpenGL Benchmarks - A Java Whetstone benchmark is provided and one using JavaDraw procedures. Test functions of the former were more than twice as fast on the Pi 4B, compared with the 3B+ and similar via javaDraw, for the more demanding tests, also many of the 25 OpenGL test routines. Initially Oracle 8 Java was used but later tests were via OpenJDK11.

Drive LAN and WiFi Benchmarks - Variations of the same program are provided to benchmark internal and USB drives or LAN and WiFi connections, measuring performance using large files, small files and random access. Considering large files, Pi 4B performance improvement shown were up to four times LAN, over five times USB 3, with similar scores using WiFi.

Stress Tests - These have also been run and will be covered in a later report. Default mode provides useful benchmarking information, as shown below. Pi 4B/Pi 3B+ performance ratios are shown to be up to 4.23 for cache based data and 2.09 using RAM.

Introduction below or Go To Start

Introduction

The Raspberry Pi 4B uses a quad core ARM A72 CPU, with 32 KB L1 cache and shared 1 MB L2 cache. RAM is 3200-LPDDR4 with 1, 2 or 4 GB options. Other enhancements are USB 3 connections and gigabit Ethernet.

I have run my benchmarks on the new system, where more descriptions and earlier results can be found in Raspberry Pi 3B+ 32 bit and 64 bit Benchmarks and stress tests.htm. The early opportunity to run the programs was due to my acceptance of the request for me to become a volunteer consultant, exercising the system prior to launch.

The programs and source codes used are available for downloading in Raspberry-Pi-4-Benchmarks.tar.gz.

My most recent benchmarks were compiled for the Raspberry Pi 2, using gcc 4.8. I tried others later, but they did not seem to make much difference. I thought that using a Cortex A72 might, so I have compiled the programs using gcc 8. The first step was to change the functions used to identify the hardware, where the existing procedures replicate information for each core (even four lots were too much). I noted that the lscpu command now provides adequate detail, so I use this now. The Raspbian release is also provided. RPi 3B+ and RPi 4B details are as follows:

Pi 3B+
Architecture: armv7l
Byte Order: Little Endian
CPU(s): 4
On-line CPU(s) list: 0-3
Thread(s) per core: 1
Core(s) per socket: 4
Socket(s): 1
Model: 4
Model name: ARMv7 Processor rev 4 (v7l)
CPU max MHz: 1400.0000
CPU min MHz: 600.0000
BogoMIPS: 89.60
Flags: half thumb fastmult vfp edsp neon vfpv3 tls vfpv4 
 idiva idivt vfpd32 lpae evtstrm crc32
Raspberry Pi reference 2018”N04ŒŽ18“ú
Pi 4B
 
Architecture: armv7l
Byte Order: Little Endian
CPU(s): 4
On-line CPU(s) list: 0-3
Thread(s) per core: 1
Core(s) per socket: 4
Socket(s): 1
Vendor ID: ARM
Model: 3
Model name: Cortex-A72
Stepping: r0p3
CPU max MHz: 1500.0000
CPU min MHz: 600.0000
BogoMIPS: 270.00
Flags: half thumb fastmult vfp edsp neon vfpv3 tls vfpv4 
 idiva idivt vfpd32 lpae evtstrm crc32
Raspberry Pi reference 2019”N05ŒŽ13“ú

Benchmark Results

The following provide benchmark results with limited comments on Raspberry Pi 4B performance gains over Pi 3B+ and relative Pi 4B relationships between older ARM V7 and gcc 8 compilations. For the first few, ancient benchmarks, ARM V6 compilations are also compared.

Whetstone Benchmark below or Go To Start

Whetstone Benchmark - whetstonePiA6, whetstonePiA7, whetstonePiC8

This has a number of simple programming loops, with the overall MWIPS rating dependent on floating point calculations, lately those identified as COS and EXP. The last three can be over optimised, but the time does not affect the overall rating much.

In this case, the overall MWIPS comparison ratios provide valid comparisons, The Pi 4B being between 1.76 and 1.87 times faster than the 3B+. Then gcc 8 provided no real improvement.


 System MHz MWIPS ------MFLOPS------ ------------MOPS---------------
 1 2 3 COS EXP FIXPT IF EQUAL
 Arm V6
 Pi 3B+ 1400 1094 391 407 348 21.7 12.3 1740 2084 1391
 Pi 4B 1500 2048 520 473 389 53.8 27.1 2497 2245 2246
 4B/3B+ 1.07 1.87 1.33 1.16 1.12 2.47 2.20 1.44 1.08 1.61
 
 ARM V7
 Pi 3B+ 1400 1060 391 383 298 21.7 12.3 1740 2083 1392
 Pi 4B 1500 1884 516 478 310 54.7 27.1 2498 2247 999
 4B/3B+ 1.07 1.78 1.32 1.25 1.04 2.52 2.21 1.44 1.08 0.72
 
 gcc 8
 Pi 3B+ 1400 1063 393 373 300 21.8 12.3 1748 2097 1398
 Pi 4B 1500 1883 522 471 313 54.9 26.4 2496 3178 998
 4B/3B+ 1.00 1.76 1.33 1.26 1.05 2.51 2.09 1.43 1.52 0.71
 
 gcc 8/V7
 Pi 4B 1.00 1.00 1.01 0.99 1.01 1.00 0.97 1.00 1.41 1.00

Go To Start

Dhrystone Benchmark - dhrystonePiA6, dhrystonePiA7, dhrystonePiC8

This appears to be the most popular ARM benchmark and often subject to over optimisation. So you can’t compare results from different compilers. Ignoring this, results in VAX MIPS aka DMIPS and comparisons follow.

The Pi 4B was shown to be around twice as fast as the 3B+ and gcc 8 performance was similar to the ARM V7 compilation.


 Best
 ----- Compiler ----- DMIPS
 System MHz ARM V6 ARM V7 gcc 8 G8/V7 /MHz

 Pi 3B+ 1400 2520 2825 2838 1.00 2.03
 Pi 4B 1500 5077 5366 5646 1.05 3.76
 4B/3B+ 1.07 2.01 1.90 1.99 1.86

Go To Start

Linpack 100 Benchmark MFLOPS - linpackPiA6, linpackPiSP, linpackPiA7, linpackPiA7SP, linpackPiC8, linpackPiC8SP

This original Linpack benchmark is dependent on fused multiply and add instructions, but the overheads on the standard source code restricts processing speed. It seems that the latest hardware has been modified to execute this type of code more efficiently.

All measurements demonstrate that the Pi 4B was between 3,6 and 4.7 times faster than the Pi 3B+.


 ARM V6 ARM V7 gcc 8 vgcc8/ARMV7
 System MHz DP SP DP SP DP SP DP SP

 Pi 3B+ 1400 206.0 220.2 210.5 225.2 224.8 227.3 1.00 1.01
 Pi 4B 1500 764.7 880.6 760.2 921.6 957.1 1068.8 1.04 1.12
 4B/3B+ 1.07 3.71 4.00 3.61 4.09 4.26 4.70

Livermore Loops Benchmark below or Go To Start

Livermore Loops Benchmark MFLOPS - liverloopsPiA7, liverloopsPiC8

This benchmark measures performance of 24 double precision kernels, initially used to select the latest supercomputer. The official average is geometric mean, where Cray 1 supercomputer was rated as 11.9 MFLOPS. Following are MFLOPS for the individual kernels, followed by overall scores.

Based on Geomean results, Pi 4B is shown as being 2.36 times faster than the 3B+, and even more so via the gcc 8 compilation, where the gcc8/V7 performance ratio identified is 1.31.

 MFLOPS for 24 loops
 
 Pi 3B+
 225 266 465 394 147 196 411 449 408 207 155 87
 100 125 263 258 359 335 236 248 133 93 339 199
 Pi 4B
 746 964 988 943 212 538 1169 1800 1032 469 214 186
 159 335 778 623 732 1034 320 350 489 360 749 187
 Pi 3B+ gcc 8
 330 262 459 407 231 198 538 542 462 247 174 198
 122 123 281 240 394 325 275 294 213 94 354 198
 Pi 4B gcc 8
 1480 1017 974 930 383 657 1624 1861 1664 617 498 741
 221 320 803 640 737 1003 451 378 1047 411 763 187

Comparisons
 System MHz Maximum Average Geomean Harmean Minimum

 ARM V7
 Pi 3B+ 1400 464.8 246.7 220.1 193.9 78.3
 Pi 4B 1500 1800.2 635.1 519.0 416.1 155.3
 4B/3B+ 1.07 3.87 2.57 2.36 2.15 1.98
 gcc 8
 Pi 3B+ 1400 541.7 283.4 257.4 231.5 92.7
 Pi 4B 1500 1860.8 800.4 679.0 564.1 179.5
 4B/3B+ 1.07 3.40 2.80 2.61 2.41 1.90
 
 g8/V7 1.00 1.03 1.26 1.31 1.36 1.16

Fast Fourier Transforms Benchmarks below or Go To Start

Fast Fourier Transforms Benchmarks - fft1-RPi2, fft3c-Rpi2, fft1PiC8, fft3cPiC8

This is a real application provided by my collaborator at Compuserve Forum. There are two benchmarks. The first one is the original C program. The second is an optimised version, originally using my x86 assembly code, but translated back into C code, making use of the partitioning and (my) arrangement to optimise for burst reading from RAM. Three measurements using both single and double data, calculating FFT sizes between 1K and 1024K.

Following are average running times from the three passes, then RPi 4B performance gains (fewer milliseconds), where all those for the optimised version were greater than 3 times and also many from the original benchmark. Most gcc 8 running times. on the Pi 4B, were slightly faster than the those produced by the older version.


 Time in milliseconds
 Raspberry Pi 3B+ FFT 1 Raspberry Pi 3B+ FFT 3 
 ARM V7 gcc 8 ARM V7 gcc 8 
 Size
 K SP DP SP DP SP DP SP DP

 1 0.14 0.14 0.16 0.17 0.18 0.14 0.15 0.14
 2 0.31 0.36 0.35 0.48 0.39 0.32 0.33 0.32
 4 0.78 0.92 0.91 1.32 1.05 0.77 0.78 0.75
 8 1.92 2.17 3.02 3.36 2.14 1.76 1.84 1.76
 16 4.67 5.28 5.09 5.99 4.71 5.46 4.27 4.89
 32 10.95 20.57 12.31 20.62 10.71 15.03 9.55 13.65
 64 34.54 128.96 37.33 130.93 28.94 36.78 26.09 33.23
 128 246.04 308.67 254.23 320.44 70.03 84.44 64.74 76.98
 256 586.84 638.88 620.49 734.14 157.29 196.35 145.14 180.66
 512 1232.41 1374.18 1235.39 1447.85 363.61 434.28 336.57 405.09
 1024 2759.71 2993.38 2779.37 3094.66 806.78 975.33 736.46 912.78

 Size Raspberry Pi 4B FFT 1 Raspberry Pi 4B FFT 3 
 K 
 1 0.04 0.04 0.04 0.04 0.06 0.05 0.05 0.04
 2 0.08 0.12 0.08 0.13 0.13 0.11 0.10 0.10
 4 0.32 0.37 0.29 0.34 0.27 0.24 0.24 0.23
 8 0.77 0.97 0.79 0.82 0.58 0.55 0.57 0.51
 16 1.69 2.01 1.65 1.85 1.49 1.35 1.32 1.19
 32 4.37 4.89 3.76 4.71 2.96 3.63 2.69 3.30
 64 9.12 26.55 8.82 30.64 7.46 10.75 6.60 9.47
 128 55.52 160.11 58.54 132.41 17.93 26.03 16.92 23.85
 256 305.92 423.06 275.44 373.12 41.16 55.06 37.61 55.97
 512 833.10 854.88 780.89 751.27 86.93 120.53 81.54 128.13
 1024 1617.49 1875.52 1578.70 1812.20 190.28 266.60 186.45 288.27

 Size RPi 4B Gains (>1.0 4B running time is less) 
 K 
 1 3.45 3.46 4.02 3.94 3.06 2.66 2.88 3.45
 2 3.79 3.14 4.27 3.84 3.10 2.93 3.28 3.29
 4 2.46 2.50 3.19 3.84 3.86 3.23 3.24 3.22
 8 2.51 2.24 3.82 4.12 3.67 3.18 3.21 3.44
 16 2.76 2.62 3.08 3.23 3.17 4.06 3.25 4.10
 32 2.51 4.21 3.27 4.38 3.62 4.14 3.55 4.13
 64 3.79 4.86 4.23 4.27 3.88 3.42 3.95 3.51
 128 4.43 1.93 4.34 2.42 3.91 3.24 3.83 3.23
 256 1.92 1.51 2.25 1.97 3.82 3.57 3.86 3.23
 512 1.48 1.61 1.58 1.93 4.18 3.60 4.13 3.16
 1024 1.71 1.60 1.76 1.71 4.24 3.66 3.95 3.17

BusSpeed Benchmark below or Go To Start

BusSpeed Benchmark - busspeedPiA7, busspeedPiC8

This is a read only benchmark with data from caches and RAM. The program reads one word with 32 skipping following data word by decreasing increments. finally reading all data. This shows were data is read in bursts, enabling estimates being made of bus speeds.

The speed via these increments can vary considerably, so comparison are provided for the read all column. Both the Pi 4B hardware and gcc 8 compilation contribute to performance gains of the new system, particularly to the highest ratio of 2.81 with impact on the larger L2 cache.


Pi 3B+ ARM V7 
 BusSpeed vfpv4 32b V1 Fri Apr 12 21:39:00 2019
 
 Reading Speed 4 Byte Words in MBytes/Second
 Memory Inc32 Inc16 Inc8 Inc4 Inc2 Read
 KBytes Words Words Words Words Words All

 16 3885 4365 4755 5013 5078 5118
 32 1688 1765 2513 3489 4279 4737
 64 716 720 1315 2268 3399 4147
 128 665 668 1206 2137 3281 4085
 256 632 635 1160 2053 3195 4032
 512 268 277 550 1058 1925 3088
 1024 140 153 296 581 1115 2199
 4096 120 131 257 498 1001 1777
 16384 126 132 256 496 991 1677
 65536 128 132 256 491 991 1950
 
 Pi 4B ARM V7 
 
 Memory Inc32 Inc16 Inc8 Inc4 Inc2 Read 
 KBytes Words Words Words Words Words All Gain
  
 16 3836 4049 4467 5885 4641 5858 1.14
 32 761 1473 2594 3216 3960 4780 1.01
 64 409 801 1684 2422 3745 3940 0.95
 128 406 803 1202 1914 3037 5377 1.32
 256 415 700 1165 2481 4789 5137 1.27
 512 392 760 1243 2455 3764 4264 1.38
 1024 230 256 623 1061 2455 3501 1.59
 4096 197 214 454 938 1852 3195 1.80
 16384 138 215 445 897 1724 3210 1.91
 65536 174 215 398 744 1655 3130 1.61

Pi 3B+ gcc 8 
 BusSpeed vfpv4 32b gcc 8 Wed May 15 09:51:20 2019
 
 Reading Speed 4 Byte Words in MBytes/Second
 Memory Inc32 Inc16 Inc8 Inc4 Inc2 Read
 KBytes Words Words Words Words Words All

 16 3833 4346 4729 5002 5046 5069
 32 2435 2532 3152 4860 4949 4999
 64 696 705 1313 2213 3278 3983
 128 651 662 1227 2077 3207 3950
 256 620 630 1183 2007 3152 3925
 512 481 503 955 1641 2618 3318
 1024 133 145 286 506 1012 1694
 4096 117 130 249 453 915 1476
 16384 124 129 247 455 910 1415
 65536 124 108 251 453 905 1445
 
 
 Pi 4B gcc 8 
 Memory Inc32 Inc16 Inc8 Inc4 Inc2 Read Pi 4B gcc 8
 KBytes Words Words Words Words Words All Gain Gain

 16 4880 5075 5612 5852 5877 5864 1.16 1.00
 32 846 1138 2153 3229 4908 5300 0.99 1.11
 64 746 1019 2035 3027 4910 5360 1.50 1.36
 128 728 983 1952 2908 4888 5389 1.52 1.00
 256 683 934 1901 2794 4874 5431 1.55 1.06
 512 656 900 1760 2625 4585 5259 1.75 1.23
 1024 301 410 870 1356 2846 4238 2.81 1.21
 4096 233 248 531 996 2151 4045 2.35 1.27
 16384 236 258 511 891 2143 4011 2.35 1.25
 65536 237 257 508 881 2172 4015 2.40 1.28

MemSpeed Benchmark below or Go To Start

MemSpeed Benchmark MB/Second - memspeedPiA7, memspeedPiC8

This includes CPU speed dependent calculations using data from caches and RAM. The calculations are shown in the results column titles. Following are full Pi 3B+ and 4B results from running the original and gcc 8 recompiled versions, plus full Pi4B/3B+ and old/gcc 8 comparisons.

Using the original ARM V7 versions, the Pi 4B is indicated as faster on all test functions, with best case on double precision calculations using cached data, being between three and six times faster. Similar gains are also shown in the gcc 8 comparisons. Then, gcc8/V7 compiler comparisons show gains with floating point but the old compiler producing some faster speeds using integers. Maximum MFLOPS performance is shown for the calculations in the first two columns, rising from 237 DP and 532 SP on the 3B+ to 1485 DP and 2740 SP on the Pi 4B, using gcc8 - improvements 6.27 times DP and 5.15 times SP..

 Pi 3B+ ARM V7 
Pi 3B+ Memory Reading Speed Test vfpv4 32 Bit Version 1 by Roy Longbottom
 Start of test Fri Apr 12 21:39:51 2019 
 Memory x[m]=x[m]+s*y[m] Int+ x[m]=x[m]+y[m] x[m]=y[m] 
 KBytes Dble Sngl Int32 Dble Sngl Int32 Dble Sngl Int32
 Used MB/S MB/S MB/S MB/S MB/S MB/S MB/S MB/S MB/S

 8 1896 2125 4046 2784 2624 4448 3165 3694 3693
 16 1900 2129 4058 2791 2627 4462 3181 3711 3711
 32 1821 2000 3664 2602 2426 3965 3187 3719 3717
 64 1807 1974 3625 2567 2369 3923 3057 3615 3599
 128 1792 1959 3620 2545 2364 3906 3079 3544 3544
 256 1738 1914 3472 2468 2291 3719 3064 3545 3553
 512 1380 1493 2199 1769 1715 2331 2192 2522 2383
 1024 1003 1138 1319 1250 1219 1298 1487 1324 1324
 2048 925 1001 1104 1065 1049 1103 1093 1032 1035
 4096 901 972 1073 1037 1005 1081 1002 968 973
 8192 852 948 1076 1041 1021 1080 1009 977 975
Max MFLOPS 237 532 
 Pi 4B ARM V7 
 
 8 8459 4766 13344 8303 4768 15553 7806 9926 9927
 16 7142 3918 8649 7103 4094 9309 7899 10086 10056
 32 7969 4490 10339 7941 4532 11627 7758 10070 10048
 64 8126 4602 9909 8114 4617 11069 7425 8021 8070
 128 8302 4651 9623 8311 4657 10836 7374 8049 7934
 256 8319 4663 9627 8360 4666 10768 7530 7922 7925
 512 8088 4629 9453 8239 4650 10696 5023 7904 7949
 1024 3581 3113 3618 3577 3150 3675 5358 2431 1560
 2048 1338 1808 1780 1811 1832 1773 2131 950 956
 4096 1881 1880 1852 1879 1664 1336 1988 984 1054
 8192 1890 1901 1884 1729 1319 1367 2252 1018 1021
Max MFLOPS 1057 1192 
Pi 4B/3B+

 8 4.46 2.24 3.30 2.98 1.82 3.50 2.47 2.69 2.69
 16 3.76 1.84 2.13 2.54 1.56 2.09 2.48 2.72 2.71
 32 4.38 2.25 2.82 3.05 1.87 2.93 2.43 2.71 2.70
 64 4.50 2.33 2.73 3.16 1.95 2.82 2.43 2.22 2.24
 128 4.63 2.37 2.66 3.27 1.97 2.77 2.39 2.27 2.24
 256 4.79 2.44 2.77 3.39 2.04 2.90 2.46 2.23 2.23
 512 5.86 3.10 4.30 4.66 2.71 4.59 2.29 3.13 3.34
 1024 3.57 2.74 2.74 2.86 2.58 2.83 3.60 1.84 1.18
 2048 1.45 1.81 1.61 1.70 1.75 1.61 1.95 0.92 0.92
 4096 2.09 1.93 1.73 1.81 1.66 1.24 1.98 1.02 1.08
 8192 2.22 2.01 1.75 1.66 1.29 1.27 2.23 1.04 1.05

Pi 3B+ gcc 8 

 8 2024 3191 1931 2973 4464 2077 3415 4426 4426
 16 2031 3194 1933 2977 4470 2078 3430 4451 4451
 32 1972 3111 1902 2842 4291 2059 3433 4455 4451
 64 1932 3042 1875 2752 4121 2008 3240 4223 4223
 128 1972 3083 1888 2825 4163 2012 3281 4272 4276
 256 1980 3089 1888 2851 4177 2013 3312 4244 4239
 512 1750 2778 1739 2460 3711 1846 3106 4029 4096
 1024 979 1862 1390 1213 2230 1463 1463 1225 1220
 2048 979 1858 1379 1137 2111 1442 859 828 828
 4096 975 1809 1363 1136 2091 1428 944 924 920
 8192 976 1788 1364 1139 2053 1409 802 792 733
Max MFLOPS 254 799 
 MemSpeed Continued Below
 
Pi 4B gcc 8 
 Memory x[m]=x[m]+s*y[m] Int+ x[m]=x[m]+y[m] x[m]=y[m] 
 KBytes Dble Sngl Int32 Dble Sngl Int32 Dble Sngl Int32
 Used MB/S MB/S MB/S MB/S MB/S MB/S MB/S MB/S MB/S

 8 11768 9844 3841 11787 9934 4351 10309 7816 7804
 16 11880 9880 3822 11886 10043 4363 10484 7902 7892
 32 9539 8528 3678 9517 8661 4098 10564 7948 7945
 64 9952 9310 3733 9997 9470 4160 8452 7717 7732
 128 9947 9591 3757 9990 9757 4178 8205 7680 7753
 256 10015 9604 3758 10030 9781 4186 8120 7734 7707
 512 9073 9300 3751 9472 9526 4175 7995 7709 7602
 1024 2681 5303 3594 2664 4965 3760 4828 3592 3569
 2048 1671 3488 3242 1757 3635 3540 2882 1036 1023
 4096 1777 3700 3283 1827 3627 3555 2433 1052 1054
 8192 1931 3805 3420 1933 3815 3629 2465 980 971
 
Max MFLOPS 1485 2740 
Pi 4B/3B+

 8 5.81 3.08 1.99 3.96 2.23 2.09 3.02 1.77 1.76
 16 5.85 3.09 1.98 3.99 2.25 2.10 3.06 1.78 1.77
 32 4.84 2.74 1.93 3.35 2.02 1.99 3.08 1.78 1.78
 64 5.15 3.06 1.99 3.63 2.30 2.07 2.61 1.83 1.83
 128 5.04 3.11 1.99 3.54 2.34 2.08 2.50 1.80 1.81
 256 5.06 3.11 1.99 3.52 2.34 2.08 2.45 1.82 1.82
 512 5.18 3.35 2.16 3.85 2.57 2.26 2.57 1.91 1.86
 1024 2.74 2.85 2.59 2.20 2.23 2.57 3.30 2.93 2.93
 2048 1.71 1.88 2.35 1.55 1.72 2.45 3.36 1.25 1.24
 4096 1.82 2.05 2.41 1.61 1.73 2.49 2.58 1.14 1.15
 8192 1.98 2.13 2.51 1.70 1.86 2.58 3.07 1.24 1.32

4B gcc 8 gains

 8 1.39 2.07 0.29 1.42 2.08 0.28 1.32 0.79 0.79
 16 1.66 2.52 0.44 1.67 2.45 0.47 1.33 0.78 0.78
 32 1.20 1.90 0.36 1.20 1.91 0.35 1.36 0.79 0.79
 64 1.22 2.02 0.38 1.23 2.05 0.38 1.14 0.96 0.96
 128 1.20 2.06 0.39 1.20 2.10 0.39 1.11 0.95 0.98
 256 1.20 2.06 0.39 1.20 2.10 0.39 1.08 0.98 0.97
 512 1.12 2.01 0.40 1.15 2.05 0.39 1.59 0.98 0.96
 1024 0.75 1.70 0.99 0.74 1.58 1.02 0.90 1.48 2.29
 2048 1.25 1.93 1.82 0.97 1.98 2.00 1.35 1.09 1.07
 4096 0.94 1.97 1.77 0.97 2.18 2.66 1.22 1.07 1.00
 8192 1.02 2.00 1.82 1.12 2.89 2.65 1.09 0.96 0.95

NeonSpeed Benchmark below or Go To Start

NeonSpeed Benchmark MB/Second - NeonSpeed, NeonSpeedC8

This carries out some of the same calculations as MemSpeed. All results are for 32 bit floating point and integer calculations. Norm functions were as generated by the compiler, using NEON directives and Neon through using intrinsic functions. Of late, both methods produce similar performance at up to 3000 million operations per second, across the board. Pi 4B/3B+ comparisons are also included below, showing the best gains in the L2 cache area. Pi 4B gcc 8 gains and losses are also provided, with the main loss on normal integer calculations from cached data.


 Pi 3B+ 
 NEON Speed Test V 1.0 Fri Apr 12 22:11:38 2019 
 Vector Reading Speed in MBytes/Second 
 Memory Float v=v+s*v Int v=v+v+s Neon v=v+v 
 KBytes Norm Neon Norm Neon Float Int

 16 3170 4669 4037 4930 5220 5545
 32 3119 4531 3952 4780 5071 5374
 64 2845 3920 3558 4075 4235 4438
 128 2873 3954 3626 4095 4227 4484
 256 2917 4027 3705 4184 4313 4563
 512 2271 2923 2777 3000 3075 3127
 1024 1181 1209 1221 1201 1163 1198
 4096 1062 1077 1071 1050 1073 1076
 16384 1087 1115 1111 1043 1094 1086
 65536 1125 1144 1139 851 1126 1110
 
 Pi 4B 
 
 16 9677 10072 8905 9358 9776 10473
 32 10149 10330 9364 9539 9988 10543
 64 10948 11708 10466 10568 11318 11994
 128 10484 11232 10410 10104 11200 11792
 256 10509 11369 10428 10264 11273 11842
 512 10406 11066 10134 10054 11075 11467
 1024 3069 3202 3159 3166 3204 3203
 4096 1721 1910 1908 1882 1903 1900
 16384 2023 2009 2008 1965 2032 2013
 65536 2073 2074 2074 2073 2068 2064

 Pi 4B/3B+ Comparisons 
 Memory Float v=v+s*v Int v=v+v+s Neon v=v+v
 KBytes Norm Neon Norm Neon Float Int

 16 3.05 2.16 2.21 1.90 1.87 1.89
 512 4.58 3.79 3.65 3.35 3.60 3.67
 1024 2.60 2.65 2.59 2.64 2.75 2.67
 16384 1.86 1.80 1.81 1.88 1.86 1.85

 Pi 3B+ gcc 8 
 NEON Speed Test gcc 8 Wed May 15 09:57:18 2019 
 Vector Reading Speed in MBytes/Second 
 Memory Float v=v+s*v Int v=v+v+s Neon v=v+v
 KBytes Norm Neon Norm Neon Float Int

 16 3289 5377 2010 5076 5731 5732
 32 3280 5341 1995 5043 5706 5706
 64 3115 4547 1923 4348 4771 4771
 128 3145 4683 1927 4482 4886 4888
 256 3146 4698 1926 4500 4906 4908
 512 2666 3762 1779 3527 3903 3915
 1024 1879 1228 1395 1225 1238 1238
 4096 1792 1151 1373 1144 1164 1162
 16384 1698 1167 1353 1119 1167 1170
 65536 1229 1157 1328 874 1165 1166

 Pi 4B gcc 8 

 16 9884 12882 3910 12773 13090 15133
 32 9904 13061 3916 13002 13162 15239
 64 9029 11526 3450 10704 11708 12084
 128 9242 11784 3391 11016 11816 12179
 256 9283 11890 3396 11215 11929 12284
 512 9043 10680 3413 10211 10925 11241
 1024 5818 3310 3507 3288 3239 2902
 4096 4060 1994 3497 1991 2009 2011
 16384 4030 2063 3445 2068 2072 2067
 65536 3936 2109 3391 1858 2122 2121
 NeonSpeed Continued Below
 

 Pi 4B/3B+ Comparisons 
 Memory Float v=v+s*v Int v=v+v+s Neon v=v+v
 KBytes Norm Neon Norm Neon Float Int

 16 3.01 2.40 1.95 2.52 2.28 2.64
 512 3.39 2.84 1.92 2.90 2.80 2.87
 1024 3.10 2.70 2.51 2.68 2.62 2.34
 16384 2.37 1.77 2.55 1.85 1.78 1.77

 4B gcc 8 gains and losses 

 16 1.02 1.28 0.44 1.36 1.34 1.44
 512 0.87 0.97 0.34 1.02 0.99 0.98
 16384 1.99 1.03 1.72 1.05 1.02 1.03

MultiThreading Benchmarks below or Go To Start

MultiThreading Benchmarks

Most of the multithreading benchmarks execute the same calculations using 1, 2, 4 and 8 threads. One of them, MP-MFLOPS, is available in two different versions, using standard compiled “C? code for single and double precision arithmetic. A further version uses NEON intrinsic functions. Another variety uses OpenMP procedures for automatic parallelism.

Go To Start

MP-Whetstone Benchmark - MP-WHETSPiA7, MP-WHETSPC8

Multiple threads each run the eight test functions at the same time, but with some dedicated variables. Measured speed is based on the last thread to finish, with Mutex functions, used to avoid the updating conflict by only allowing one thread at a time to access common data. Performance is generally proportional to the number of cores used. There can be some significant differences from the single CPU Whetstone benchmark results on particular tests due to a different compiler being used. None of the test functions are suitable for SIMD operation, with the simpler instructions being used. Overall seconds indicates MP efficiency.

Based on the 4 thread MWIPS rating, both compilations indicate the same Pi4B performance improvement, but there are variations on the individual test functions.


 Pi 3B+ ARM V7 
 MP-Whetstone Benchmark Linux/ARM V7A v1.0 Wed Apr 24 22:48:42 2019
 Using 1, 2, 4 and 8 Threads 
 MWIPS MFLOPS MFLOPS MFLOPS Cos Exp Fixpt If Equal
 1 2 3 MOPS MOPS MOPS MOPS MOPS

 1T 1116.9 582.4 603.6 299.7 21.7 13.3 6969.0 1364.0 1398.5
 2T 2226.5 1167.8 1181.0 593.5 43.4 26.4 12545.8 2789.0 2794.1
 4T 4436.8 2354.9 2387.3 1190.1 86.3 52.5 27429.4 5539.7 5546.8
 8T 4614.6 3174.1 3140.6 1250.0 88.1 54.7 36555.2 6409.9 6051.1
 Overall Seconds 4.99 1T, 5.02 2T, 5.10 4T, 10.20 8T 

 Pi 4B ARM V7 

 1T 2059.3 672.8 680.1 310.6 55.6 33.1 7461.6 2244.6 995.2
 2T 4117.1 1341.7 1390.7 624.2 110.7 65.9 14887.3 4466.5 1986.2
 4T 7910.0 2652.0 2722.2 1180.0 208.5 132.6 29291.2 8952.4 3832.3
 8T 8651.6 3057.1 2971.1 1268.3 233.2 149.6 38367.5 11922.5 3941.7
 Overall Seconds 4.99 1T, 5.01 2T, 5.29 4T, 10.71 8T 

 Pi 3B+ gcc 8 
 MP-Whetstone Benchmark Linux/ARM gcc 8 Fri Jun 14 14:25:28 2019 
 
 MWIPS MFLOPS MFLOPS MFLOPS Cos Exp Fixpt If Equal
 1 2 3 MOPS MOPS MOPS MOPS MOPS

 1T 1057.5 390.9 392.6 298.1 21.0 12.3 5227.8 1363.1 1399.4
 2T 2121.8 777.4 778.5 598.3 42.3 24.6 10185.9 2769.0 2762.9
 4T 4225.9 1509.6 1532.2 1192.3 84.7 48.8 19273.0 5326.5 5552.9
 8T 4419.6 1914.9 2041.9 1260.8 86.0 51.3 27645.3 7213.5 6031.5
 Overall Seconds 4.98 1T, 5.00 2T, 5.11 4T, 10.09 8T 

 Pi 4B gcc 8 
 
 1T 1889.5 538.7 537.6 311.4 56.3 26.1 7450.5 2243.2 659.9
 2T 3782.7 1065.5 1071.2 627.1 112.3 52.0 14525.7 4460.9 1327.3
 4T 7564.1 2101.0 2145.9 1250.4 225.0 104.1 29430.5 8944.2 2660.8
 8T 8003.6 2598.8 2797.0 1313.0 233.2 110.4 37906.3 10786.7 2799.4
 Overall Seconds 4.99 1T, 5.00 2T, 5.03 4T, 10.06 8T 

 4 Thread 4B/3B+ Performance ratios 

 V7 1.78 1.13 1.14 0.99 2.42 2.53 1.07 1.62 0.69
 gcc8 1.79 1.39 1.40 1.05 2.66 2.13 1.53 1.68 0.48

MP-Dhrystone Benchmark below or Go To Start

MP-Dhrystone Benchmark - MP-DHRYPiA7, MP-DHRYPiC8

This executes multiple copies of the same program, but with some shared data, leading to inconsistent multithreading performance, as reflected in the results. The single thread speeds were similar to the earlier Dhrystone results, with RPi 4B ratings around twice as fast as those for the Pi 3B+, for both compilations, with gcc 8 code being slightly the fastest.


 Pi 3B+ ARM V7 
 MP-Dhrystone Benchmark Linux/ARM V7A v1.0 Wed Apr 24 22:57:46 2019
 Using 1, 2, 4 and 8 Threads 

 Threads 1 2 4 8
 Seconds 0.85 0.96 1.36 2.71
 Dhrystones per Second 4733611 8295393 11750518 11789451
 VAX MIPS rating 2694 4721 6688 6710

 Pi 4B ARM V7 
 
 Seconds 0.82 1.59 2.70 5.04
 Dhrystones per Second 9731507 10082787 11833655 12706636
 VAX MIPS rating 5539 5739 6735 7232

 Pi 3B+ gcc 8 

 Threads 1 2 4 8
 Seconds 0.79 0.92 1.23 2.46
 Dhrystones per Second 5035879 8678942 13020489 13028455
 VAX MIPS rating 2866 4940 7411 7415

 Pi 4B gcc 8 
 
 Threads 1 2 4 8
 Seconds 0.79 1.21 2.62 4.88
 Dhrystones per Second 10126308 13262168 12230188 13106002
 VAX MIPS rating 5763 7548 6961 7459

MP Linpack Benchmark below or Go To Start

MP SP NEON Linpack Benchmark - linpackNeonMP, linpackNeonMPC8

This executes a single copy of the benchmark, at three data sizes, with the critical daxpy code multithreaded. This code was also modified to allow a higher level of parallelism, without changing any calculations. Still MP performance was much slower than running as a single thread. The main reasons appear to be updating data in RAM, to maintain integrity, with performance reflecting memory speeds, and overheads of exceptionally high thread start/stop overheads.

Single thread performance, was the slowest accessing the larger data arrays (N value), more constant across the four sets of results. Fastest Pi 4B improvements were at N = 100, at around three times.

The programs produce the sumchecks, as shown below, with the four sets of calculations producing identical numeric results (as they should).


 Pi 3B+ ARM V7 
 
 Linpack Single Precision MultiThreaded Benchmark
 Using NEON Intrinsics, Wed Apr 24 23:03:08 2019
 MFLOPS 0 to 4 Threads, N 100, 500, 1000 
 Threads None 1 2 4 

 N 100 627.07 66.31 64.79 64.14 
 N 500 465.16 293.95 292.37 293.76 
 N 1000 346.63 311.81 309.19 311.76 

 Pi 4B ARM V7 

 N 100 1921.53 108.66 101.88 102.46 
 N 500 1548.81 530.23 714.37 733.09 
 N 1000 399.94 378.11 364.78 398.21 

 Pi 3B+ gcc 8 

 N 100 638.49 66.92 66.23 66.14 
 N 500 471.71 304.69 297.05 305.51 
 N 1000 356.13 317.22 316.88 316.33 

 Pi 4B gcc 8 

 N 100 2007.38 112.55 107.85 106.98 
 N 500 1332.24 686.10 686.11 689.02 
 N 1000 402.61 435.26 432.21 432.01 

 Sumchecks 
 N 100 500 1000

 NR 2.17 5.42 9.50
 RE 5.16722466e-05 6.46698638e-04 2.26586126e-03
 MA 1.19209290e-07 1.19209290e-07 1.19209290e-07
 X0 -2.38418579e-07 -5.54323196e-05 -1.26898289e-04
 XN -5.06639481e-06 -4.70876694e-06 1.41978264e-04

MP BusSpeed Benchmark below or Go To Start

MP BusSpeed (read only) Benchmark - MP-BusSpeedPiA7, MP-BusSpd2PiC8

Each thread accesses all of the data in separate sections covering caches and RAM, starting at different points, with this V7A v2 version. See single processor BusSpeed details regarding burst reading that can indicate significant differences. RdAll is the main area for comparison, where MP reading RAM is thought to indicate maximum performance.

Comparisons are provided for RdAll, at 1, 2 and 4 threads. These are subject to multiprocessing peculiarities, but Pi 4B/Pi 3B+ performance gains were indicated as being around 2.5, using L1 cache data, and twice as fast, via L2 cache and RAM, with the gcc 8 produced version little different from the earlier compilations.

 Pi 3B+ ARM V7 
 MP-BusSpd ARM V7A v2 Wed Apr 24 22:58:50 2019 
 MB/Second Reading Data, 1, 2, 4 and 8 Threads 
 Staggered starting addresses to avoid caching 
 KB Inc32 Inc16 Inc8 Inc4 Inc2 RdAll 

 12.3 1T 3470 4390 4408 4760 5138 4926 
 2T 6272 7807 8321 9131 9780 9599 
 4T 9867 13732 15514 17568 19512 18209 
 8T 7385 10918 12320 14591 17357 16462 
122.9 1T 662 648 1253 2129 3291 4475 
 2T 1044 1032 2003 3611 6135 8931 
 4T 1068 1085 2180 4354 8409 16053 
 8T 1057 1078 2124 4247 8227 15070 
12288 1T 125 131 252 494 1009 1996 
 2T 195 136 272 501 1088 2121 
 4T 126 135 263 515 1017 1922 
 8T 114 136 305 545 994 2076 

 Pi 4B ARM V7 
 Pi 4B/3B+

 12.3 1T 5263 5637 5809 5894 5936 13445 2.73
 2T 9412 10020 10567 11454 11604 24980 2.60
 4T 16282 15577 16418 21222 20000 45530 2.50
 8T 11600 13285 16070 18579 20593 36837 
122.9 1T 739 956 1888 3153 5008 9527 2.13
 2T 629 1158 1568 5058 9509 16489 1.85
 4T 600 1093 2134 4527 8732 16816 1.05
 8T 593 1104 2121 4382 8629 17158 
12288 1T 238 258 518 1005 2001 4029 2.02
 2T 278 228 453 1690 1826 3628 1.71
 4T 269 257 740 1019 1790 4145 2.16
 8T 233 292 532 926 2186 3581 

 Pi 3B+ gcc 8 
 MP-BusSpd ARM V7A gcc 8 Wed May 15 10:06:27 2019 
 
 KB Inc32 Inc16 Inc8 Inc4 Inc2 RdAll 

 12.3 1T 3555 4451 4382 4788 5124 5205 
 2T 6515 8132 8332 9016 9793 10100 
 4T 10667 14186 15956 17529 19228 16522 
 8T 7463 10987 13299 14948 17756 16781 
122.9 1T 681 683 1211 2133 3280 4713 
 2T 1049 1057 2009 3848 6155 9293 
 4T 1049 1085 2191 4360 7921 16268 
 8T 1072 1092 2180 4303 8156 15722 
12288 1T 125 131 256 495 1005 1970 
 2T 135 133 273 505 1100 2110 
 4T 116 130 243 511 1009 2059 
 8T 126 138 260 532 1061 2017 
 Pi 4B gcc 8 
 Pi 4B/3B+

 12.3 1T 5310 5616 5801 5898 5940 13425 2.54
 2T 9393 10008 11293 11293 11368 24932 2.47
 4T 15781 15015 17606 19034 22279 40736 2.47
 8T 8465 9599 14580 18465 20034 36831 
122.9 1T 664 930 1861 3191 5017 10281 2.18
 2T 564 726 1523 5376 9387 18985 2.04
 4T 486 919 1886 4289 8337 16979 1.04
 8T 487 912 1854 4275 8271 16826 
12288 1T 225 258 514 1010 1992 3975 2.02
 2T 202 421 450 1765 3307 7396 3.51
 4T 261 288 825 1332 1772 5014 2.44
 8T 218 273 496 1041 2571 4021

MP RandMem Benchmark below or Go To Start

MP RandMem Benchmark - MP-RandMemPiA7, MP-RandMemPiC8

This benchmark potentially reads and writes all data, in sections covering caches and RAM, each thread starting at different addresses. Random access can select any address after that. Serial reading speed is normally similar to BusSpeed RdAll. Writing tends to be involve updating the appropriate memory area, providing constant speeds. Random access is significantly affected by burst reading and writing.

Besides the full results, comparisons of the four thread results are shown below for Pi 4B/3B+ performance ratios. The Pi 3B+ appears to be faster reading data from the shared L2 cache, with 4 threads only, otherwise, the average performance of the new processor was indicated as 80% faster.

 Pi 3B+ ARM V7 
 MP-RandMem Linux/ARM V7A v1.0 Wed Apr 24 22:54:55 2019
 KB SerRD SerRDWR RndRD RndRDWR

 12.3 1T 3419 4333 3420 4422
 2T 6531 4397 6515 4397
 4T 12814 4308 12896 4303
 8T 12922 4289 12561 4244
122.9 1T 3133 3959 800 1041
 2T 5992 3959 1469 1040
 4T 11584 3913 2322 1025
 8T 11417 3895 2288 1028
12288 1T 2034 795 48 62
 2T 2176 799 93 63
 4T 3183 790 128 63
 8T 2008 788 130 62

 Pi 4B ARM V7 

 12.3 1T 5860 7905 5927 7657
 2T 11747 7908 11182 7746
 4T 21416 7626 17382 7731
 8T 20649 7528 20431 7378
122.9 1T 5479 7269 1826 1923
 2T 10355 6964 1667 1920
 4T 9808 7177 1715 1908
 8T 11677 7058 1697 1919
12288 1T 3438 1271 179 152
 2T 4176 1204 213 167
 4T 4227 1117 337 161
 8T 3479 1093 287 168

 Pi 4B/3B+ 

 12.3 4T 1.67 1.77 1.35 1.80
122.9 4T 0.85 1.83 0.74 1.86
12288 4T 1.33 1.41 2.63 2.56

 Pi 3B+ gcc 8 

 12.3 1T 4362 4386 4363 4386
 2T 8222 4308 8132 4311
 4T 16391 4268 16396 4286
 8T 16297 4244 15510 4228
122.9 1T 3643 3879 925 1025
 2T 7008 3873 1692 1040
 4T 12553 3877 2373 1038
 8T 12000 3881 2330 1043
12288 1T 1848 833 67 62
 2T 2183 829 119 63
 4T 3672 825 135 63
 8T 2608 826 136 63
 
 Pi 4B gcc 8 

 12.3 1T 5950 7903 5945 7896
 2T 11849 7923 11887 7917
 4T 23404 7785 23395 7761
 8T 21903 7669 23104 7655
122.9 1T 5670 7309 2002 1924
 2T 10682 7285 1648 1923
 4T 9944 7266 1813 1927
 8T 9896 7216 1812 1919
12288 1T 3904 1075 179 164
 2T 7317 1055 215 164
 4T 3398 1063 343 165
 8T 4156 1062 350 165

 Pi 4B/3B+ gcc 8 

 12.3 4T 1.43 1.82 1.43 1.81
122.9 4T 0.79 1.87 0.76 1.86
12288 4T 0.93 1.29 2.54 2.62

MP-MFLOPS Benchmarks below or Go To Start

MP-MFLOPS Benchmarks - MP-MFLOPSPiA7, MP-MFLOPSDP, MP-NeonMFLOPS,
MP-MFLOPSPiC8, MP-MFLOPSDPC8, MP-NeonMFLOPSC8

MP-MFLOPS measures floating point speed on data from caches and RAM. The first calculations are as used in Memory Speed Benchmark, with a multiply and an add per data word read. The second uses 32 operations per input data word of the form x[i] = (x[i]+a)*b-(x[i]+c)*d+(x[i]+e)*f -- more. Tests cover 1, 2, 4 and 8 threads, each carrying out the same calculations but accessing different segments of the data. The numeric results are converted into a simple sumcheck, that should be constant, irrespective of the number of threads used. Correct values are included at the end of the results below. Note the differences using NEON functions and double or single precision floating point instructions.

Note across the board Pi 4B performance gains on all programs, with maximum speeds of 17.2 GFLOPS for single precision calculations and and 10.4 GFLOPS using double precision.


 Single Precision Version 
 Pi 3B+ ARM V7 
 MP-MFLOPS Linux/ARM V7A v1.0 Wed Apr 24 23:08:19 2019
 2 Ops/Word 32 Ops/Word 
 KB 12.8 128 12800 12.8 128 12800
 MFLOPS 
 1T 214 212 189 813 812 797
 2T 403 427 354 1613 1587 1573
 4T 717 811 372 3044 3027 2982
 8T 756 777 388 3005 3101 3064

 Pi 4B ARM V7 

 1T 987 993 606 2816 2794 2804
 2T 1823 1837 567 5610 5541 5497
 4T 2119 3349 647 9884 10702 9081
 8T 3136 3783 609 10230 10504 9240
Max 
4B/3B+ 415 4.66 1.67 3.36 3.45 3.02

 Pi 3B+ gcc 8 

 1T 214 212 189 799 784 781
 2T 417 417 365 1568 1583 1540
 4T 754 683 385 3026 3017 2919
 8T 738 761 401 3053 2997 2866

 Pi 4B gcc 8 

 1T 1224 1257 520 2814 2800 2803
 2T 2485 2257 525 5608 5575 5576
 4T 4119 3243 534 11018 10645 8358
 8T 4131 4618 541 9941 10339 8165
Max 
4B/3B+ 5.48 6.07 1.35 3.61 3.53 2.86
 ###################################################

 NEON Intrinsic Functions Version 
 Pi 3B+ ARM V7 
 MP-MFLOPS NEON Intrinsics v1.0 Wed Apr 24 22:41:38 2019
 2 Ops/Word 32 Ops/Word 
 KB 12.8 128 12800 12.8 128 12800
 MFLOPS 
 1T 692 685 393 2052 2017 1887
 2T 1126 1358 403 4096 3924 3697
 4T 2434 2030 405 7848 7740 5547
 8T 2363 2095 407 7584 7609 6097

 Pi 4B ARM V7 

 1T 2491 2399 615 4325 4285 4261
 2T 5629 5520 591 8602 8463 8308
 4T 10580 5594 553 16991 16493 9124
 8T 7047 10785 513 14325 16219 8867
Max 
4B/3B+ 4.35 5.15 1.36 2.17 2.13 1.50
 MP-MFLOPS Continued Below
 

 Pi 3B+ gcc 8 
 2 Ops/Word 32 Ops/Word 
 KB 12.8 128 12800 12.8 128 12800
 MFLOPS 
 1T 691 684 407 1910 1874 1828
 2T 1214 1306 410 3746 3747 3392
 4T 1943 2568 410 7403 7435 5913
 8T 2093 2233 411 7217 7087 6044

 Pi 4B gcc 8 

 1T 2797 2870 641 4422 4454 4405
 2T 3217 5601 569 8587 8800 8377
 4T 7902 9864 611 17061 17215 9704
 8T 7070 10562 603 15531 16203 9516
Max 
4B/3B+ 3.78 4.13 1.49 2.30 2.32 1.61
 ###################################################

 Double Precision Version 
 Pi 3B+ ARM V7 
 MP-MFLOPS Double Precision v1.0 Sat Jun 15 12:07:33 2019
 2 Ops/Word 32 Ops/Word 
 KB 12.8 128 12800 12.8 128 12800
 MFLOPS 
 1T 209 206 166 782 797 747
 2T 415 416 198 1566 1590 1462
 4T 663 801 198 3125 3122 2770
 8T 746 729 199 3061 2909 2745

 Pi 4B ARM V7 

 1T 1187 1220 309 2682 2714 2701
 2T 2420 2416 282 5379 5415 4780
 4T 4665 2381 317 10256 10336 5242
 8T 4385 3114 310 9721 10340 5131
Max 
4B/3B+ 6.25 3.89 1.59 3.28 3.31 1.89

 Pi 3B+ gcc 8 

 1T 214 213 168 798 797 776
 2T 409 416 194 1567 1590 1466
 4T 694 675 195 3122 3120 2751
 8T 698 797 198 3055 3005 2779

 Pi 4B gcc 8 

 1T 1203 1211 315 2675 2719 2674
 2T 2291 2441 293 5406 5421 4907
 4T 4673 2501 309 10313 10393 5256
 8T 4394 3550 265 8782 10110 5197
Max 
4B/3B+ 6.69 4.45 1.56 3.30 3.33 1.89

 Sumchecks 

 SP 76406 97075 99969 66015 95363 99951
 NEON 76406 97075 99969 66014 95363 99951
 DP 76384 97072 99969 66065 95370 99951

OpenMP-MFLOPS Benchmarks below or Go To Start

OpenMP-MFLOPS - OpenMP-MFLOPS, notOpenMP-MFLOPS, OpenMP-MFLOPSC8,
OpenMP-MFLOPSDPC8, notOpenMP-MFLOPSC8, notOpenMP-MFLOPSDPC8

This benchmark carries out the same calculations as the MP-MFLOPS Benchmarks but, in addition, calculations with eight operations per data word. There is also notOpenMP-MFLOPS single core version, compiled from the same code and carrying out identical numbers of floating point calculations, but without an OpenMP compile directive. With gcc 8, additional versions have been produced, using double precision floating point. The general format and standard parameters are as follows.

The final data values are checked for consistency. Different compilers or different CPUs could involve using alternative instructions or rounding effects, with variable accuracy. Then, OpenMP sumchecks could be expected to be the same as those from NotOpenMP single core values. However, this is not always the case. The double precision gcc 8 benchmarks appears to be consistent, but only single precision sumchecks are provided.

This benchmark was a compilation of code used for desktop PCs, starting at 100 KB, then 1 MB and 10 MB.


 OpenMP MFLOPS Benchmark 1 Wed Apr 24 22:51:10 2019 
 Test 4 Byte Ops/ Repeat Seconds MFLOPS First All
 Words Word Passes Results Same

 Data in & out 100000 2 2500 0.281575 1776 0.929538 Yes
 Data in & out 1000000 2 250 1.265817 395 0.992550 Yes
 Data in & out 10000000 2 25 1.222289 409 0.999250 Yes
 Data in & out 100000 8 2500 0.376635 5310 0.957126 Yes
 Data in & out 1000000 8 250 1.305504 1532 0.995524 Yes
 Data in & out 10000000 8 25 1.267736 1578 0.999550 Yes
 Data in & out 100000 32 2500 3.285631 2435 0.890232 Yes
 Data in & out 1000000 32 250 3.351830 2387 0.988068 Yes
 Data in & out 10000000 32 25 3.329400 2403 0.998785 Yes
 End of test Wed Apr 24 22:51:26 2019 

SumChecks
V7A OMP 3B+ 4B
0.929538 0.992550 0.999250 0.957126 0.995524 0.999550 0.890232 0.988068 0.998785

V7A Not 3B+ 4B
0.929538 0.992550 0.999250 0.957126 0.995524 0.999550 0.890268 0.988078 0.998806

gcc 8 OMP 3B+, Not 3B+ 4B
0.929538 0.992550 0.999250 0.957126 0.995524 0.999550 0.890282 0.988096 0.998806

gcc 8 4B OMP
0.098043 0.810084 0.922891 0.144870 0.922568 0.918226 0.401577 0.935064 0.916277

gcc 8 DP OMP 3B+ 4B, Not 3B+ 4B
0.929474 0.992543 0.999249 0.957164 0.995525 0.999550 0.890377 0.988101 0.998799

MFLOPS Performance and Comparisons

The firsts comparisons below identify OpenMP 4 core performance gains. For 2 and 8 operations per word read and written, real gains can only be seen with 100 KB data size. With CPU speed limitations at 32 operations per word, single core MFLOPS is shown to be constant at all data sizes, but high OpenMP speeds only occurring using 100 KB data size.

The other comparisons identify Pi 4B performance gains over the Pi 3B+, where those applying to single core use being better than via OpenMP. Highest OpenMP improvement was 4.5 times, via gcc 8 and double precision operation. Maximum demonstrated Pi 4B speeds were 19.9 GFLOPS single precision and 9.3 GFLOPS double precision.


 V7A Compiler 
 Pi 3B+ Pi 4B Pi4 Gains 
 KB+Ops 4 1 4 core 4 1 4 core 4 1 
 /Word Cores Core Gain Cores Core Gain Cores Core

 100- 2 1776 831 2.14 4716 2850 1.65 2.66 3.43
 1000- 2 395 391 1.01 556 429 1.30 1.41 1.10
10000- 2 409 409 1.00 544 632 0.86 1.33 1.55
 100- 8 5310 2009 2.64 7981 5191 1.54 1.50 2.58
 1000- 8 1532 1445 1.06 2389 2082 1.15 1.56 1.44
10000- 8 1578 1478 1.07 2199 2003 1.10 1.39 1.36
 100-32 2435 1855 1.31 8147 5449 1.50 3.35 2.94
 1000-32 2387 1733 1.38 7951 5385 1.48 3.33 3.11
10000-32 2403 1736 1.38 8030 5379 1.49 3.34 3.10
 OpenMP-MFLOPS Continued Below
 

 gcc 8 Compiler 
 Pi 3B+ Pi 4B Pi4 Gains 
 KB+Ops 4 1 4 core 4 1 4 core 4 1 
 /Word Cores Core Gain Cores Core Gain Cores Core

 100- 2 2139 778 2.75 5100 2270 2.25 2.38 2.92
 1000- 2 398 403 0.99 617 632 0.98 1.55 1.57
10000- 2 412 415 0.99 542 631 0.86 1.32 1.52
 100- 8 7348 1919 3.83 13805 5511 2.50 1.88 2.87
 1000- 8 1597 1448 1.10 2168 2217 0.98 1.36 1.53
10000- 8 1635 1444 1.13 2178 2542 0.86 1.33 1.76
 100-32 8497 2023 4.20 19921 5341 3.73 2.34 2.64
 1000-32 5997 1903 3.15 8556 5267 1.62 1.43 2.77
10000-32 6057 1914 3.16 8731 5276 1.65 1.44 2.76

 gcc 8 Double Precision 
 Pi 3B+ Pi 4B Pi4 Gains 
 KB+Ops 4 1 4 core 4 1 4 core 4 1 
 /Word Cores Core Gain Cores Core Gain Cores Core

 100- 2 711 203 3.50 3200 977 3.28 4.50 4.81
 1000- 2 193 168 1.15 274 295 0.93 1.42 1.76
10000- 2 199 172 1.16 273 307 0.89 1.37 1.78
 100- 8 1898 503 3.77 6771 2440 2.78 3.57 4.85
 1000- 8 730 434 1.68 1102 1072 1.03 1.51 2.47
10000- 8 755 435 1.74 1108 1255 0.88 1.47 2.89
 100-32 3072 793 3.87 9229 2725 3.39 3.00 3.44
 1000-32 2695 765 3.52 4256 2674 1.59 1.58 3.50
10000-32 2719 765 3.55 4469 2677 1.67 1.64 3.50

Floating Point Assembly Code below or Go To Start

Floating Point Assembly Code

The latest floating point performance improvements, via gcc 8, are due to better use of NEON instructions. If I have read this report correctly, double precision ARM NEON SIMD is not supported on V7 CPUs, only Single Instruction Single Data (SISD), where fused multiply and add instructions can produce two results per clock cycle, or a maximum of 3 GFLOPS per core on Pi 4, or 12 GFLOPS overall.

In my MP MFLOPS programs, the routines that include 32 double precision floating point operations per data word read, disassembly indicates that the following instructions are used, with 64 bit d registers, where maximum measured speed was just over 10 GFLOPS.

.L18: 
 vldr.64 d17, [r1] 
 vadd.f64 d16, d17, d4 
 vadd.f64 d18, d17, d0 
 vadd.f64 d25, d17, d15
 vadd.f64 d24, d17, d11
 vmul.f64 d16, d16, d5 
 vadd.f64 d23, d17, d31
 vadd.f64 d22, d17, d27
 vadd.f64 d21, d17, d2 
 vadd.f64 d20, d17, d6 
 vadd.f64 d19, d17, d13
 vfma.f64 d16, d18, d1 
 vadd.f64 d18, d17, d9 
 vadd.f64 d17, d17, d29
 vfma.f64 d16, d25, d14
 vfma.f64 d16, d24, d10
 vfma.f64 d16, d23, d30
 vfma.f64 d16, d22, d28
 vfms.f64 d16, d21, d3 
 vfms.f64 d16, d20, d7 
 vfms.f64 d16, d19, d12
 vfms.f64 d16, d18, d8 
 vfms.f64 d16, d17, d26
 vstmia.64 r1!, {d16} 
 cmp r0, r1 
 bne .L18

It is not clear (to me) what the maximum speed is for single precision calculations. These appear to compile to full SIMD operation, using quad word registers. With fused multiply and add, that could amount to 8 results per clock cycle, with 12 GFLOPS from one Pi 4 core and 48 GFLOPS overall. Maximum obtained was around 20 GFLOPS.

OpenMP-MemSpeed Benchmarks below or Go To Start

OpenMP-MemSpeed - OpenMP-MemSpeed2, NotOpenMP-MemSpeed2,
OpenMP-MemSpeed2C8, NotOpenMP-MemSpeed2C8

This is the same program as the single core MemSpeed benchmark, but with increased memory sizes and compiled using OpenMP directives. The same program was also compiled without these directives (NotOpenMP-MemSpeed2), with the example single core results also shown after the detailed measurements. Although the source code appears to be suitable for speed up by parallelisation, many of the test functions are slower using OpenMP, with effects on Pi 3B+ and Pi 4B not the same. Detailed comparisons of these results are rather meaningless.


 Pi 3B+ ARM V7 
 Memory Reading Speed Test OpenMP Version 2 by Roy Longbottom 
 Start of test Wed Apr 24 22:45:07 2019 
 Memory x[m]=x[m]+s*y[m] Int+ x[m]=x[m]+y[m] x[m]=y[m] 
 KBytes Dble Sngl Int32 Dble Sngl Int32 Dble Sngl Int32
 Used MB/S MB/S MB/S MB/S MB/S MB/S MB/S MB/S MB/S

 4 6432 3483 1646 10276 5514 1770 18468 9721 1534
 8 7041 3603 1651 11747 5783 1788 19068 10085 1538
 16 7023 3606 1557 11694 5839 1672 19316 9528 1469
 32 6983 3600 1525 11413 5915 1656 19385 9532 1442
 64 6283 3554 1584 10861 5751 1621 14307 9466 1443
 128 6828 3578 1580 11074 5828 1659 10791 8935 1490
 256 5384 3365 1521 11216 5166 1687 9806 8148 1519
 512 5371 3253 1511 8917 4858 1412 7752 4363 1365
 1024 3084 2643 1066 3772 3504 1314 1450 1403 1136
 2048 3345 2087 1086 4148 3589 1471 1052 1063 1139
 4096 915 2648 894 4143 2456 1655 984 987 1190
 8192 3644 2504 1124 4183 3530 1496 903 909 1074
 16384 963 2050 922 3867 3154 1478 752 849 1156
 32768 3889 2467 1179 3562 3328 1667 838 833 1150
 65536 3902 2009 1109 3843 1437 1596 917 917 927
 131072 986 667 819 1145 904 820 858 865 584
 
 Not OMP 
 8 1860 2972 4449 2787 4039 4449 3168 3164 3170
 256 1810 2791 4137 2655 3860 4135 3126 3065 3066
 65536 960 1121 1109 1100 1120 1115 901 793 844

 Pi 4B ARM V7 
 Memory Reading Speed Test OpenMP Version 2 by Roy Longbottom 
 Memory x[m]=x[m]+s*y[m] Int+ x[m]=x[m]+y[m] x[m]=y[m]
 KBytes Dble Sngl Int32 Dble Sngl Int32 Dble Sngl Int32
 Used MB/S MB/S MB/S MB/S MB/S MB/S MB/S MB/S MB/S

 4 7732 8092 1266 7627 8431 1616 31436 15892 889
 8 7546 8158 1284 7925 8537 1597 30383 16635 884
 16 7695 8198 1261 7854 8549 1598 27037 15644 896
 32 7773 7808 1255 8036 7727 1612 29621 16928 897
 64 9728 9094 1233 9355 9028 1602 16855 13297 867
 128 11296 10068 1002 11342 10813 1686 13594 15106 794
 256 13987 11677 1231 15357 13496 1732 12707 10415 878
 512 17763 8841 1170 10023 13404 1529 12655 9137 693
 1024 6070 6553 1262 10196 10069 1455 5405 5027 670
 2048 3858 6609 1343 6440 6643 1657 2234 2324 877
 4096 6055 6743 989 6608 6568 1664 2114 2369 777
 8192 1669 2047 1126 7071 6894 1581 2532 2569 857
 16384 1974 1953 1385 6748 4399 1763 2643 1845 753
 32768 1594 3482 1115 7680 7494 1814 1739 1908 1147
 65536 2630 7446 1320 1632 1826 1651 2061 2920 904
 131072 1438 1540 1249 1714 1694 1244 1760 2011 856

 Not OMP 
 8 8602 11536 13324 8607 11756 13378 7826 7689 7670
 256 8319 9856 10030 8338 8984 9308 5800 7510 7535
 65536 1373 1725 2071 2059 2072 2044 2170 912 900
 OpenMP-MemSpeed Continued Below
 
 
 Pi 3B+ gcc 8 
 
 Memory Reading Speed Test OpenMP gcc 8 by Roy Longbottom 
 Memory x[m]=x[m]+s*y[m] Int+ x[m]=x[m]+y[m] x[m]=y[m] 
 KBytes Dble Sngl Int32 Dble Sngl Int32 Dble Sngl Int32
 Used MB/S MB/S MB/S MB/S MB/S MB/S MB/S MB/S MB/S

 4 7065 3661 8370 10058 5245 9260 18199 9342 9242
 8 7350 3854 9338 11747 5786 10201 19108 9663 9412
 16 7444 3955 9543 11918 5961 10696 19339 9854 9831
 32 7198 3953 9537 9783 5908 10683 19075 9958 9971
 64 6848 3901 9057 11146 5168 9187 10408 9399 9440
 128 7655 3916 9113 11204 5785 10073 10315 9185 9191
 256 7044 3921 9154 11263 5785 10114 9601 9002 9019
 512 6662 3579 7738 9326 5206 7931 8313 7911 7903
 1024 4050 2892 4167 3997 3674 4318 1437 1422 1435
 2048 3996 2879 4134 4038 3624 4325 1042 1012 999
 4096 3909 2803 4078 3981 3591 4223 1047 988 1044
 8192 3880 2871 3805 4196 3555 4117 935 948 940
 16384 1366 2193 3757 4058 3178 3895 902 894 843
 32768 2202 2138 3428 3577 3335 3559 871 793 893
 65536 1180 1119 1696 1447 1178 1721 853 874 868
 131072 1016 688 1096 1133 893 1141 844 1141 1080

Not OMP 
 8 2020 1878 2056 2959 2018 2068 3398 4406 4406
 256 1973 1833 1990 2845 1966 1993 3306 4215 4215
 65536 1016 1248 1287 1130 1302 1301 1005 928 915

 Pi 4B gcc 8 
 Memory Reading Speed Test OpenMP gcc 8 by Roy Longbottom 
 Memory x[m]=x[m]+s*y[m] Int+ x[m]=x[m]+y[m] x[m]=y[m] 
 KBytes Dble Sngl Int32 Dble Sngl Int32 Dble Sngl Int32
 Used MB/S MB/S MB/S MB/S MB/S MB/S MB/S MB/S MB/S

 4 8097 8322 8641 8020 8436 8384 39701 19701 19712
 8 7814 8555 8756 8321 8548 8526 39042 19984 19996
 16 8149 7738 7742 8303 7779 8192 37995 19883 19984
 32 8969 8769 8799 9040 8759 8743 37737 20133 20130
 64 7617 7457 7437 7575 7380 7422 17770 15332 14248
 128 11221 10936 11003 11105 11011 10986 13650 13910 13881
 256 17883 18144 18036 17691 18094 17844 13073 12465 12535
 512 18001 18468 19675 17075 18221 19264 13511 13895 12008
 1024 9532 10590 9772 11842 11282 11277 7173 9473 9496
 2048 7095 7025 6866 7117 7043 6946 2914 3475 3468
 4096 7244 6927 7036 5951 7054 6531 2582 3130 3122
 8192 4578 7173 7025 6322 7078 7182 2504 3127 3115
 16384 5470 7043 7067 7103 7052 7020 2557 3093 3088
 32768 7359 7817 7766 7158 7078 7757 2618 3066 3094
 65536 7810 7268 7266 3824 7478 5164 2486 3016 2931
 131072 2460 2655 7224 7513 7308 7339 2540 2944 2940

 Not OMP 
 8 11775 3895 4342 11787 4325 4354 10334 7806 7816
 256 10032 3699 4223 9978 4289 4185 7105 7612 7621
 65536 2099 2587 3033 2103 3021 3001 2585 1105 1101

I/O Benchmarks below or Go To Start

I/O Benchmarks

Two varieties of I/O benchmarks are provided, one to measure performance of main and USB drives, and the other for LAN and WiFi network connections. The Raspberry Pi programs write and reads three files at two sizes (defaults 8 and 16 MB), followed by random reading and writing of 1KB blocks out of 4. 8 and 16 MB and finally, writing and reading 200 small files, sized 4, 8 and 16 KB. Run time parameters are provided for the size of large files and file path. The same program code is used for both varieties, the only difference being file opening properties. The drive benchmark includes extra options to use direct I/O, avoiding data caching in main memory, but includes an extra test with caching allowed. For further details and downloads see the usual Raspberry Pi 3B+ 32 bit and 64 bit Benchmarks and stress tests.htm,

Go To Start

LanSpeed Benchmarks - WiFi - LanSpeed

Following are Raspberry Pi 3B+ and Pi 4B results using what I believe was, both 2.4 GHz and 5 GHz WiFi frequencies. Details on setting up the links can be found in Raspberry Pi 3B+ 32 bit and 64 bit Benchmarks and stress tests.htm LAN/WiFi section. Performance of the two systems was similar at both frequencies.


 ******************** Pi 3B+ 2.4 GHz ********************

 MBytes/Second
 MB Write1 Write2 Write3 Read1 Read2 Read3
 8 5.71 6.07 5.96 5.69 5.46 4.76
 16 6.14 6.38 6.47 6.14 6.15 5.91
 Random Read Write
 From MB 4 8 16 4 8 16
 msecs 2.94 3.081 3.185 3.04 2.89 3.7
 200 Files Write Read Delete
 File KB 4 8 16 4 8 16 secs
 MB/sec 0.16 0.57 0.96 0.36 0.63 1.17
 ms/file 25.3 14.31 17.1 11.46 13.04 14.06 2.138

 ********************* Pi 3B+ 5 GHz *********************

 MBytes/Second
 MB Write1 Write2 Write3 Read1 Read2 Read3
 
 8 12.82 14.52 14.00 10.98 11.09 8.94
 16 11.60 12.91 4.48 9.16 8.19 7.69
 200 Files Write Read Delete
 File KB 4 8 16 4 8 16 secs
 MB/sec 0.41 0.76 1.46 0.41 0.74 1.46
 ms/file 9.96 10.83 11.19 10.11 11.02 11.23 1.990
 Random similar to 2.4 GHz

 ********************* Pi 4B 2.4 GHz ********************

 MBytes/Second
 MB Write1 Write2 Write3 Read1 Read2 Read3
 8 6.35 6.33 6.38 7.05 6.98 7.10
 16 6.70 6.82 6.76 7.19 6.53 7.22
 Random Read Write
 From MB 4 8 16 4 8 16
 msecs 2.691 2.875 3.048 3.13 2.93 2.84
 200 Files Write Read Delete
 File KB 4 8 16 4 8 16 secs
 MB/sec 0.34 0.44 1.04 0.37 0.37 1.26
 ms/file 12.14 18.59 15.7 11.1 22.2 12.99 2.153

 ********************** Pi 4B 5 GHz *********************

 MBytes/Second
 MB Write1 Write2 Write3 Read1 Read2 Read3
 8 11.90 12.96 13.16 10.11 9.55 9.66
 16 11.50 13.93 14.13 9.91 8.88 9.92
 200 Files Write Read Delete
 File KB 4 8 16 4 8 16 secs
 MB/sec 0.13 0.46 0.91 0.25 0.55 1.02
 ms/file 30.85 17.83 18.10 16.62 14.93 16.01 3.361
 Random similar to 2.4 GHz

LanSpeed Benchmark below or Go To Start

LanSpeed Benchmark - (1G bits per second Ethernet on Pi 4B) - LanSpeed

There can be significant variability in performance with these small samples. For the large files, the default sizes were increased to produce more stable speeds. In this case, 1 Gbps was clearly demonstrated using the Pi 4B, around three times faster than the Pi3B+. Random access was mainly slightly faster via the Pi 4B and with the small files, perhaps, 25% faster on writing and 50% faster on reading.


 ************************ Pi 3B+ ************************
 
 MBytes/Second
 MB Write1 Write2 Write3 Read1 Read2 Read3
 8 31.17 31.62 31.61 13.5 26.19 26.38
 16 31.62 31.89 31.76 26.7 26.94 27.01
 Random Read Write
 From MB 4 8 16 4 8 16
 msecs 0.007 1.09 0.688 1.16 1.04 1.08
 200 Files Write Read Delete
 File KB 4 8 16 4 8 16 secs
 MB/sec 1.15 2.26 4.18 1.73 3.18 5.66
 ms/file 3.57 3.62 3.92 2.36 2.58 2.89 0.511

 Larger Files
 MBytes/Second
 MB Write1 Write2 Write3 Read1 Read2 Read3
 32 31.99 31.61 32.13 21.39 27.09 26.87
 64 32.33 32.37 32.35 26.94 26.98 26.7

 ************************ Pi 4B ************************

 MBytes/Second
 MB Write1 Write2 Write3 Read1 Read2 Read3
 8 67.82 12.97 90.19 99.84 93.49 96.83
 16 92.25 92.66 92.96 103.9 105.28 91.17
Random Read Write
From MB 4 8 16 4 8 16
msecs 0.007 0.01 0.04 1.01 0.85 0.91
200 Files Write Read Delete
File KB 4 8 16 4 8 16 secs
MB/sec 1.47 2.8 5.14 2.47 4.71 8.61
ms/file 2.78 2.92 3.19 1.66 1.74 1.9 0.256

 Larger Files
 MBytes/Second
 MB Write1 Write2 Write3 Read1 Read2 Read3
 32 78.2 34.46 80.71 84.94 87.11 84.97
 64 88.18 87.52 87.03 111.34 109.58 107.28
 128 98.84 99.24 96.58 110.99 110.57 87.43
 256 106.75 105.43 106.4 85.78 108.99 106.29

USB Benchmarks below or Go To Start

USB Benchmarks - DriveSpeed

Following are DriveSpeed results on Pi 3B+ and 4B, using the same high speed USB 3 stick (SanDisk Extreme with write/read ratings of 110/190 MB/s and 16 KB sectors). Other sticks would probably provide different comparative performance.

On large files, Pi 4B performance gains on the largest files shown, were 2.2 times on writing and 5.3 times on reading. Unlike LanSpeed, DriveSpeed uses Direct i/O, leading to an extra entry for cached files, reading mainly influenced by RAM speeds. Results can be too variable to provide meaningful comparisons.

Random access speeds were quite similar. On small files, relative reading speed was indicates as five times faster, on the Pi 4B, but the 3B+ appeared to be nearly 30 times faster, on reading.

For the Pi 4B, additional large file performance are included for a Patriot Rage 2 USB 3 stick, rated as reading at up to 400 MB/second, with near 300 MB/second demonstrated using a Windows version of DriveSpeed.. In this case, it appeared to be slightly slower than the first one on reading, but faster on writing, at 80 MB/second. This second drive also obtained those painfully slow speeds on writing small files.

 ********************* Pi 3B+ USB 2 ********************

 DriveSpeed RasPi 1.1 Wed Apr 24 22:09:09 2019
 /media/pi/REMIX_OS/
 Total MB 9017, Free MB 7486, Used MB 1531
 MBytes/Second
 MB Write1 Write2 Write3 Read1 Read2 Read3
 8 27.71 27.35 27.13 30.72 30.9 31.31
 16 27.21 27.54 23.69 29.89 31.34 31.27
Cached
 8 52.24 59.57 46.88 333.08 741.57 780.68
Random Read Write
From MB 4 8 16 4 8 16
msecs 0.403 0.403 0.404 0.74 0.85 0.59
200 File Write Read Delete
File KB 4 8 16 4 8 16 secs
MB/sec 1.10 2.12 3.82 6.04 9.17 14.01
ms/file 3.71 3.86 4.28 0.68 0.89 1.17 0.123
 MBytes/Second
MB Write1 Write2 Write3 Read1 Read2 Read3
 1000 27.25 27.25 27.19 31.23 31.27 31.27
 2000 27.30 27.07 27.32 31.32 31.26 31.26

 ********************* Pi 4B USB 3 *********************

 DriveSpeed RasPi 1.1 Fri Apr 26 17:21:56 2019
 /media/pi/REMIXOSSYS//
 Total MB 5108, Free MB 3982, Used MB 1126
 MBytes/Second
 MB Write1 Write2 Write3 Read1 Read2 Read3
 8 33.28 32.27 32.28 161.34 162.25 163.85
 16 39.85 41.95 43.02 164.07 165.53 165.84
Cached
 8 33.32 34.96 34.96 593.94 582.25 589.22
Random Read Write
From MB 4 8 16 4 8 16
msecs 0.383 0.372 0.371 0.77 0.83 0.63
200 File Write Read Delete
File KB 4 8 16 4 8 16 secs
MB/sec 0.04 0.07 0.15 20.64 41.04 70.01
ms/file 110.04 109.97 110.01 0.20 0.20 0.23 0.089
 MBytes/Second
 MB Write1 Write2 Write3 Read1 Read2 Read3
 500 56.36 58.13 55.25 166.31 165.46 165.43
 1000 59.56 61.46 60.54 161.69 165.97 166.49
 /media/pi/PATRIOT/
 Total MB 120832, Free MB 120832, Used MB 0
 MBytes/Second
 MB Write1 Write2 Write3 Read1 Read2 Read3
 1000 80.87 80.23 81.92 131.41 130.72 130.39
 2000 83.67 81.82 82.14 130.85 131.29 131.36

Main Drive Benchmark below or Go To Start

Pi 4B Main Drive Benchmark - DriveSpeed

This demonstrates that DriveSpeed measured performance on the main drive, in this case, nowhere near to USB 3 speeds.

 
 DriveSpeed RasPi 1.1 Mon Apr 29 10:20:57 2019
 Current Directory Path: /home/pi/Raspberry_Pi_Benchmarks/DriveSpeed/drive1
 Total MB 14845, Free MB 8198, Used MB 6646
 MBytes/Second
 MB Write1 Write2 Write3 Read1 Read2 Read3
 8 16.41 11.21 12.27 39.81 40.10 40.39
 16 11.79 21.10 34.05 40.18 40.19 40.33
Cached
 8 137.47 156.43 285.59 580.73 598.66 587.97
Random Read Write
From MB 4 8 16 4 8 16
msecs 0.371 0.371 0.363 1.28 1.53 1.30
200 File Write Read Delete
File KB 4 8 16 4 8 16 secs
MB/sec 3.49 6.41 8.26 7.67 11.68 17.51
ms/file 1.17 1.28 1.98 0.53 0.70 0.94 0.014

Java Whetstone Benchmark below or Go To Start

Java Whetstone Benchmark - whetstc.class

The Java benchmarks were run after installing Oracle Java 8, then OpenJDK11 later.

Pi 4B performance was nearly as good as the compiled C version. However, there can be wide variations involving new Java versions. Here, the Pi 3B+ overall MWIPS rating was particularly slow, entirely due to the time taken by the sin,cos and exp,sqrt tests. Other than these, the Pi 4B was three to four times faster.


 ************************ Pi 3B+ ************************

 Whetstone Benchmark Java Version, May 14 2019, 15:02:11
 1 Pass
 Test Result MFLOPS MOPS millisecs
 N1 floating point -1.124750137 215.20 0.0892
 N2 floating point -1.131330490 208.76 0.6438
 N3 if then else 1.000000000 103.58 0.9992
 N4 fixed point 12.000000000 538.09 0.5854
 N5 sin,cos etc. 0.499110103 7.04 11.8100
 N6 floating point 0.999999821 106.22 5.0780
 N7 assignments 3.000000000 322.85 0.5724
 N8 exp,sqrt etc. 0.751108646 1.38 26.9200
 MWIPS 214.14 46.6980
 Operating System Linux, Arch. arm, Version 4.14.70-v7+
 Java Vendor Oracle Corporation, Version 1.8.0_212
 

 ************************ Pi 4B ************************

 Whetstone Benchmark Java Version, May 14 2019, 14:16:44
 1 Pass
 Test Result MFLOPS MOPS millisecs
 N1 floating point -1.124750137 503.94 0.0381
 N2 floating point -1.131330490 488.37 0.2752
 N3 if then else 1.000000000 332.80 0.3110
 N4 fixed point 12.000000000 881.37 0.3574
 N5 sin,cos etc. 0.499110132 42.92 1.9384
 N6 floating point 0.999999821 345.77 1.5600
 N7 assignments 3.000000000 332.97 0.5550
 N8 exp,sqrt etc. 0.825148463 25.00 1.4880
 MWIPS 1533.01 6.5231
 Operating System Linux, Arch. arm, Version 4.19.29-v7l+
 Java Vendor Oracle Corporation, Version 1.8.0_212

 ******************* Pi 4B OpenJDK11 *******************

 Whetstone Benchmark OpenJDK11 Java Version, May 15 2019, 18:48:20
 1 Pass
 Test Result MFLOPS MOPS millisecs
 N1 floating point -1.124750137 524.02 0.0366
 N2 floating point -1.131330490 494.12 0.2720
 N3 if then else 1.000000000 289.92 0.3570
 N4 fixed point 12.000000000 1092.99 0.2882
 N5 sin,cos etc. 0.499110132 59.86 1.3900
 N6 floating point 0.999999821 345.95 1.5592
 N7 assignments 3.000000000 331.54 0.5574
 N8 exp,sqrt etc. 0.825148463 25.41 1.4640
 MWIPS 1687.92 5.9244
 Operating System Linux, Arch. arm, Version 4.19.37-v7l+
 Java Vendor BellSoft, Version 11.0.2-BellSoft

JavaDraw Benchmark below or Go To Start

JavaDraw Benchmark - JavaDrawPi.class

The benchmark uses small to rather excessive simple objects to measure drawing performance in Frames Per Second (FPS). Five tests draw on a background of continuously changing colour shades, each test adding to the load. In order for this to run at maximum speed, it was necessary to disable the experimental GL driver.

Pi 4B performance gains were best on the most complex test function.

A later version was produced and run via OpenJDK11.


 ************************ Pi 3B+ ************************

 Java Drawing Benchmark, May 14 2019, 15:32:06
 Produced by javac 1.6.0_27
 Test Frames FPS
 Display PNG Bitmap Twice Pass 1 566 56.55
 Display PNG Bitmap Twice Pass 2 651 65.00
 Plus 2 SweepGradient Circles 665 66.45
 Plus 200 Random Small Circles 660 65.93
 Plus 320 Long Lines 442 44.16
 Plus 4000 Random Small Circles 334 33.30
 Total Elapsed Time 60.1 seconds
 Operating System Linux, Arch. arm, Version 4.14.70-v7+
 Java Vendor Oracle Corporation, Version 1.8.0_212

 ************************ Pi 4B ************************

 Java Drawing Benchmark, May 14 2019, 14:33:58
 Produced by javac 1.7.0_02
 Test Frames FPS
 Display PNG Bitmap Twice Pass 1 791 79.05
 Display PNG Bitmap Twice Pass 2 932 93.11
 Plus 2 SweepGradient Circles 1152 115.17
 Plus 200 Random Small Circles 1200 119.98
 Plus 320 Long Lines 784 78.31
 Plus 4000 Random Small Circles 621 62.03
 Total Elapsed Time 60.1 seconds
 Operating System Linux, Arch. arm, Version 4.19.29-v7l+
 Java Vendor Oracle Corporation, Version 1.8.0_212

 ******************* Pi 4B OpenJDK11 *******************

 Java Drawing Benchmark, May 15 2019, 18:55:41
 Produced by OpenJDK 11 javac
 Test Frames FPS
 Display PNG Bitmap Twice Pass 1 877 87.65
 Display PNG Bitmap Twice Pass 2 1042 104.18
 Plus 2 SweepGradient Circles 1015 101.47
 Plus 200 Random Small Circles 779 77.85
 Plus 320 Long Lines 336 33.52
 Plus 4000 Random Small Circles 83 8.25
 Total Elapsed Time 60.1 seconds
 Operating System Linux, Arch. arm, Version 4.19.37-v7l+
 Java Vendor BellSoft, Version 11.0.2-BellSoft

OpenGL GLUT Benchmark below or Go To Start

OpenGL GLUT Benchmark - videogl32

In 2012, I approved a request from a Quality Engineer at Canonical, to use this OpenGL benchmark in the testing framework of the Unity desktop software. The program can be run as a benchmark, or selected functions, as a stress test of any duration.

After installing freeglut3, the benchmark ran as before. The benchmark measures graphics speed in terms of Frames Per Second (FPS) via six simple and more complex tests. The first four tests portray moving up and down a tunnel including various independently moving objects, with and without texturing. The last two tests, represent a real application for designing kitchens. The first is in wireframe format, drawn with 23,000 straight lines. The second has colours and textures applied to the surfaces.

As a benchmark, it was run using the following script file

 export vblank_mode=0 
 ./videogl32 Width 320, Height 240, NoEnd 
 ./videogl32 Width 640, Height 480, NoHeading, NoEnd 
 ./videogl32 Width 1024, Height 768, NoHeading, NoEnd 
 ./videogl32 NoHeading

Following are results from the Pi 3B+ and Pi 4B. The early tests depend on graphics speed and the later ones becoming CPU speed dependent.


 ************************ Pi 3B+ ************************

 GLUT OpenGL Benchmark 32 Bit Version 1, Fri Apr 12 22:21:35 2019
 Running Time Approximately 5 Seconds Each Test
 Window Size Coloured Objects Textured Objects WireFrm Texture
 Pixels Few All Few All Kitchen Kitchen
 Wide High FPS FPS FPS FPS FPS FPS
 320 240 343.8 208.3 88.4 56.6 24.3 15.5
 640 480 243.0 170.3 82.8 54.5 24.2 15.5
 1024 768 110.6 101.2 63.6 47.8 24.1 15.4
 1920 1080 49.5 47.3 36.8 32.9 23.4 14.9

 ************************ Pi 4B ************************

 GLUT OpenGL Benchmark 32 Bit Version 1, Thu May 2 19:01:05 2019
 Running Time Approximately 5 Seconds Each Test
 Window Size Coloured Objects Textured Objects WireFrm Texture
 Pixels Few All Few All Kitchen Kitchen
 Wide High FPS FPS FPS FPS FPS FPS
 320 240 766.7 371.4 230.6 130.2 32.5 22.7
 640 480 427.3 276.5 206.0 121.8 31.7 22.2
 1024 768 193.1 178.8 150.5 110.4 31.9 21.5
 1920 1080 81.4 79.4 74.6 68.3 30.8 20.0

Stress Tests below or Go To Start

Stress Tests - MP-IntStress, MP-FPUStress, MP-FPUStressDP

A series of stress tests have also been run on the Raspberry Pi 4B and these will be covered in a later report. They have command line parameters for running time, data size, number of threads, log number and complexity of calculations. In default mode, combinations of these are used to indicate relative performance, providing useful benchmarks. Following are Pi 3B+ and Pi 4B results.


 ************************ Pi 3B+ ************************
 MP-Integer-Test 32 Bit v1.0 Fri Jun 21 15:09:22 2019
 Benchmark 1, 2, 4, 8, 16 and 32 Threads
 MB/second
 KB KB MB Same All
 Secs Thrds 16 160 16 Sumcheck Tests

 9.4 1 3497 3284 1813 00000000 Yes
 6.3 2 6994 6505 2123 FFFFFFFF Yes
 5.6 4 13839 12528 1882 5A5A5A5A Yes
 5.6 8 13723 13780 1872 AAAAAAAA Yes
 5.6 16 13734 14049 1857 CCCCCCCC Yes
 5.6 32 13499 13881 1879 0F0F0F0F Yes

 ************************ Pi 4B ************************
 MP-Integer-Test 32 Bit v1.0 Fri Jun 21 15:39:57 2019

 4.9 1 5956 5754 3977 00000000 Yes
 3.6 2 11861 11429 3763 FFFFFFFF Yes
 3.1 4 22998 21799 3464 5A5A5A5A Yes
 3.1 8 22695 21128 3490 AAAAAAAA Yes
 3.1 16 22835 23491 3485 CCCCCCCC Yes
 3.0 32 22593 23485 3591 0F0F0F0F Yes
 Average Gains Caches 1.68, RAM 1.91

 ************************ Pi 3B+ ************************
 MP-Threaded-MFLOPS 32 Bit v1.0 Fri Jun 21 15:10:28 2019
 Benchmark 1, 2, 4 and 8 Threads
 MFLOPS Numeric Results
 Ops/ KB KB MB KB KB MB
 Secs Thrd Word 12.8 128 12.8 12.8 128 12.8

 3.1 T1 2 857 849 414 40392 76406 99700
 5.5 T2 2 1661 1678 411 40392 76406 99700
 7.4 T4 2 3086 3336 413 40392 76406 99700
 9.4 T8 2 3194 3168 414 40392 76406 99700
 13.8 T1 8 1942 1935 1495 54756 85091 99820
 16.7 T2 8 3756 3824 1659 54756 85091 99820
 19.0 T4 8 7209 7528 1643 54756 85091 99820
 21.3 T8 8 6978 7341 1657 54756 85091 99820
 36.8 T1 32 2019 2050 1915 35296 66020 99519
 44.6 T2 32 4078 4031 3757 35296 66020 99519
 48.9 T4 32 7927 7910 6095 35296 66020 99519
 53.1 T8 32 7919 8141 6336 35296 66020 99519

 ************************ Pi 4B ************************
 MP-Threaded-MFLOPS 32 Bit v1.0 Sun May 26 21:23:49 2019

 1.6 T1 2 2134 2607 656 40392 76406 99700
 2.9 T2 2 5048 5156 621 40392 76406 99700
 4.0 T4 2 7536 9939 681 40392 76406 99700
 5.2 T8 2 7934 9839 639 40392 76406 99700
 7.2 T1 8 5535 5420 2569 54756 85091 99820
 8.7 T2 8 10757 10732 2454 54756 85091 99820
 10.1 T4 8 18108 20703 2444 54756 85091 99820
 11.5 T8 8 19236 20286 2245 54756 85091 99820
 17.4 T1 32 5309 5270 5262 35296 66020 99519
 20.4 T2 32 10551 10528 9753 35296 66020 99519
 22.4 T4 32 20120 20886 11064 35296 66020 99519
 24.5 T8 32 19415 20464 9929 35296 66020 99519
 Average Gains Caches 2.72, RAM 1.75
 Stress Tests Continued Below
 

 ************************ Pi 3B+ ************************
 MP-Threaded-MFLOPS 32 Bit v1.0 Fri Jun 21 15:11:41 2019
 Double Precision Benchmark 1, 2, 4 and 8 Threads
 MFLOPS Numeric Results
 Ops/ KB KB MB KB KB MB
 Secs Thrd Word 12.8 128 12.8 12.8 128 12.8

 9.7 T1 2 215 213 173 40395 76384 99700
 15.9 T2 2 420 426 206 40395 76384 99700
 20.6 T4 2 819 830 205 40395 76384 99700
 25.3 T8 2 807 823 205 40395 76384 99700
 41.4 T1 8 508 502 437 54805 85108 99820
 49.8 T2 8 1002 1008 778 54805 85108 99820
 55.8 T4 8 1985 1955 768 54805 85108 99820
 61.6 T8 8 1974 1958 817 54805 85108 99820
 100.5 T1 32 799 794 775 35159 66065 99521
 120.1 T2 32 1595 1588 1533 35159 66065 99521
 130.5 T4 32 3115 3087 2731 35159 66065 99521
 140.7 T8 32 3154 3126 2821 35159 66065 99521

 ************************ Pi 4B ************************
 MP-Threaded-MFLOPS 32 Bit v1.0 Sun May 26 21:26:37 2019
 Double Precision Benchmark 1, 2, 4 and 8 Threads

 3.4 T1 2 921 998 326 40395 76384 99700
 6.1 T2 2 1968 1995 308 40395 76384 99700
 8.4 T4 2 3465 3925 342 40395 76384 99700
 10.9 T8 2 3646 3702 301 40395 76384 99700
 15.1 T1 8 2377 2446 1283 54805 85108 99820
 18.1 T2 8 4916 4860 1326 54805 85108 99820
 20.5 T4 8 9202 9510 1391 54805 85108 99820
 23.1 T8 8 9090 9006 1298 54805 85108 99820
 34.5 T1 32 2695 2725 2707 35159 66065 99521
 40.3 T2 32 5416 5441 5121 35159 66065 99521
 44.1 T4 32 10666 10831 5275 35159 66065 99521
 48.3 T8 32 10427 10602 4832 35159 66065 99521
 Average Gains Caches 4.23, RAM 2.09

Go To Start