- geerlingguy
- Posts: 585
- Joined: Sun Feb 15, 2015 3:43 am
Re: NUMA Testing
That's in line with the other memory specific tests posted in the thread. Have you tried any other more realistic benchmarks, or any of your own software, to see if there are speedups with those usage patterns?
The question is not whether something should be done on a Raspberry Pi, it is whether it can be done on a Raspberry Pi.
Re: NUMA Testing
The calculation
8575.85 / 12319.23 = 0.696135229
indicates that NUMA is 30 percent slower. This is surprising since multi-core bandwidth-constrained code is supposed to be optimised by the patch.
Do different batches of Raspberry Pi have other memory chips that behave differently?
Oh, wait. Try running the test with four cores.
Re: NUMA Testing
Let me start by describing what I have here: two Raspberry Pi 5's 8Gb. Both of them from the same batch (Pi store in Cambridge UK, before people started receiving them at home). One has meaningful workloads running in the background, and the other is completely blank. Both exhibit the same results, baring statistical noise.
And you are right, I neglected the number of threads. The results are still weird, but here they come:
No SDRAM_BANKLOW=1. 4 threads, increasing the total memory size tenfold to 100G, 1G blocks:
No SDRAM_BANKLOW=1. 1 thread, total memory size 100G, 1G blocks:
With SDRAM_BANKLOW=1. 4 threads, total memory size 100G, 1G blocks:
With SDRAM_BANKLOW=1. 1 thread, total memory size 100G, 1G blocks:
I also tested with 2 threads. Now, I can certainly do this more systematically (and running multiple experiments in multiple workloads). I am not so concerned about the overall performance improvement of the Pi 5, as I am trying to understand the memory impact here:
I have my own interpretation of this data, but would love to hear your thoughts. As for the concept of realistically measuring this without synthetic benchmarks, I will refrain from commenting on the likes of Geekbench. Here's something tangible: how much time does it take for the PI (full of services) to boot:
NUMA enabled:
NUMA disabled:
I wouldn't read too much on the 3% uplift on the kernel (or the 0.4% in userspace). In my world this is statistical noise.
And you are right, I neglected the number of threads. The results are still weird, but here they come:
No SDRAM_BANKLOW=1. 4 threads, increasing the total memory size tenfold to 100G, 1G blocks:
Code: Select all
sysbench memory --memory-block-size=1G --memory-total-size=100G --threads=4 --memory-oper=write run
sysbench 1.0.20 (using system LuaJIT 2.1.0-beta3)
Running the test with following options:
Number of threads: 4
Initializing random number generator from current time
Running memory speed test with the following options:
block size: 1048576KiB
total size: 102400MiB
operation: write
scope: global
Initializing worker threads...
Threads started!
Total operations: 72 ( 6.96 per second)
73728.00 MiB transferred (7131.56 MiB/sec)
General statistics:
total time: 10.3370s
total number of events: 72
Latency (ms):
min: 408.45
avg: 571.50
max: 607.71
95th percentile: 601.29
sum: 41148.34
Threads fairness:
events (avg/stddev): 18.0000/1.00
execution time (avg/stddev): 10.2871/0.05Code: Select all
sysbench memory --memory-block-size=1G --memory-total-size=100G --threads=1 --memory-oper=write run
sysbench 1.0.20 (using system LuaJIT 2.1.0-beta3)
Running the test with following options:
Number of threads: 1
Initializing random number generator from current time
Running memory speed test with the following options:
block size: 1048576KiB
total size: 102400MiB
operation: write
scope: global
Initializing worker threads...
Threads started!
Total operations: 100 ( 12.00 per second)
102400.00 MiB transferred (12283.80 MiB/sec)
General statistics:
total time: 8.3350s
total number of events: 100
Latency (ms):
min: 82.68
avg: 83.35
max: 85.03
95th percentile: 84.47
sum: 8334.53
Threads fairness:
events (avg/stddev): 100.0000/0.00
execution time (avg/stddev): 8.3345/0.00Code: Select all
sysbench memory --memory-block-size=1G --memory-total-size=100G --threads=4 --memory-oper=write run
sysbench 1.0.20 (using system LuaJIT 2.1.0-beta3)
Running the test with following options:
Number of threads: 4
Initializing random number generator from current time
Running memory speed test with the following options:
block size: 1048576KiB
total size: 102400MiB
operation: write
scope: global
Initializing worker threads...
Threads started!
Total operations: 86 ( 8.39 per second)
88064.00 MiB transferred (8593.18 MiB/sec)
General statistics:
total time: 10.2470s
total number of events: 86
Latency (ms):
min: 265.19
avg: 472.53
max: 719.95
95th percentile: 612.21
sum: 40637.52
Threads fairness:
events (avg/stddev): 21.5000/0.50
execution time (avg/stddev): 10.1594/0.06
Code: Select all
sysbench memory --memory-block-size=1G --memory-total-size=100G --threads=1 --memory-oper=write run
sysbench 1.0.20 (using system LuaJIT 2.1.0-beta3)
Running the test with following options:
Number of threads: 1
Initializing random number generator from current time
Running memory speed test with the following options:
block size: 1048576KiB
total size: 102400MiB
operation: write
scope: global
Initializing worker threads...
Threads started!
Total operations: 84 ( 8.38 per second)
86016.00 MiB transferred (8581.84 MiB/sec)
General statistics:
total time: 10.0216s
total number of events: 84
Latency (ms):
min: 113.32
avg: 119.30
max: 122.29
95th percentile: 118.92
sum: 10021.14
Threads fairness:
events (avg/stddev): 84.0000/0.00
execution time (avg/stddev): 10.0211/0.00
Code: Select all
| NUMA / Threads | 1 | 2 | 4 |
|----------------|-----------|-----------|-----------|
| Off | 12,283.80 | 9,162.35 | 7,131.56 |
| On | 8,581.84 | 9,208.26 | 8,593.18 |
| Gain/Loss | -30.14% | 0.50% | 20.50% |
NUMA enabled:
Code: Select all
Startup finished in 3.482s (kernel) + 10.074s (userspace) = 13.556s
multi-user.target reached after 10.052s in userspace.
Code: Select all
Startup finished in 3.593s (kernel) + 10.134s (userspace) = 13.728s
multi-user.target reached after 10.099s in userspace- geerlingguy
- Posts: 585
- Joined: Sun Feb 15, 2015 3:43 am
Re: NUMA Testing
I've been running a full suite of tests from ollama/LLMs, to pts linux-kernel-recompile, 4K and 1080p x264 transcoding, HPL, and more and all the tests show a significant boost (between 12-30%) when running with the new settings.
So far I've yet to see any testing result in a regression, except for memory bandwidth tests (I used tinymembench). Would be interesting to see exactly why that's the case!
The question is not whether something should be done on a Raspberry Pi, it is whether it can be done on a Raspberry Pi.
Re: NUMA Testing
I like your table. My interpretation is the NUMA allocator is working as expected--better when all cores are active but worse for single threaded.bytter wrote: ↑Sat Dec 07, 2024 12:24 pmI am not so concerned about the overall performance improvement of the Pi 5, as I am trying to understand the memory impact here:
I have my own interpretation of this data, but would love to hear your thoughts.Code: Select all
| NUMA / Threads | 1 | 2 | 4 | |----------------|-----------|-----------|-----------| | Off | 12,283.80 | 9,162.35 | 7,131.56 | | On | 8,581.84 | 9,208.26 | 8,593.18 | | Gain/Loss | -30.14% | 0.50% | 20.50% |
Re: NUMA Testing
What do all of those tests share in common (in fact, what are the typical uses for non-faked NUMA)? Could it be due to improved memory locality, thread affinity optimisation and inherently better multi-threaded scaling?geerlingguy wrote: ↑Sat Dec 07, 2024 1:20 pmI've been running a full suite of tests from ollama/LLMs, to pts linux-kernel-recompile, 4K and 1080p x264 transcoding, HPL, and more and all the tests show a significant boost (between 12-30%) when running with the new settings.
So far I've yet to see any testing result in a regression, except for memory bandwidth tests (I used tinymembench). Would be interesting to see exactly why that's the case!
Here's another hypothesis: look at my second column (2 threads). How do you explain that one? Could it be because there's low memory contention, and the overheads of fake NUMA balance out?
Sorry, I'm not trying to be dense here. My hypothesis is that there is some kind of workload that _benefits_ from this, but not all. All of those applications are fair benchmarks: I am curious, though, on the memory access patterns they exhibit to understand why memory bandwidth tests paint a different picture.
Nice YouTube channel, btw ;-)
Re: NUMA Testing
Is this going to be added to raspi-config (and Desktop Config) so I don't have to keep remembering what to edit? :oops:
Re: NUMA Testing. Is this way safe enough?
Re: NUMA Testing
The "latest" release of the bootloader EEPROM now have it enabled by default, so if you're happy using that, no need to remember anything.
https://github.com/raspberrypi/rpi-eepr ... e-notes.md . -2711 for the Pi 4 is the same.
Re: NUMA Testing
NUMA is a way of representing the non uniformity of CPU<->RAM performance exhibited by more complex CPU/Memory Controller/RAM topologies to the OS in an effort to increase the likelihood that the memory locations accessed by a particular process are as "near" as possible to the CPU core that process is executing on, in an effort to reduce memory latency and increase bandwidth.
The simplest example is a host with dual CPUs, each of which has associated memory controller/RAM, with an interconnect between the two CPUs.
Performance is better (and energy consumption potentially lower) if processes are mostly accessing RAM physically attached to the CPU they are allocated to, rather than having to traverse the inter CPU interconnect, which adds latency, power consumption and may be a bandwidth bottleneck in some scenarios.
Re: NUMA Testing
Did some tests with Yamagi Quake 2 (OpenGL ES3) and vkQuake3 (Vulkan) on my Pi 5 8GB. Fully updated installation of Pi OS:
These are some pretty big regressions (and the first I've seen with NUMA enabled, except for the synthetic write benchmark mentioned earlier). The only change between these tests is that I removed the "SDRAM_BANKLOW=1" row from the EEPROM config for the NUMA off tests.
EDIT: Just an observation: At least for the Quake 2 results, the results seemed more consistent with NUMA on. I got exactly the same result on the first run as on the fifth and the run-to-run variance was very low. With NUMA off the result tended to increase with the first few runs and the difference between the first and fifth run was ~3 %.
Code: Select all
Linux raspberrypi 6.6.62+rpt-rpi-2712 #1 SMP PREEMPT Debian 1:6.6.62-1+rpt1 (2024年11月25日) aarch64 GNU/LinuxCode: Select all
BOOTLOADER: up to date
CURRENT: Tue 12 Nov 16:10:44 UTC 2024 (1731427844)
LATEST: Tue 12 Nov 16:10:44 UTC 2024 (1731427844)Code: Select all
NUMA off NUMA on
Yamagi Quake 2 (1080p, GLES3, 16xAF) 102,0 95,0
vkQuake3 (1080p, Vulkan) 168,5 148,0
EDIT: Just an observation: At least for the Quake 2 results, the results seemed more consistent with NUMA on. I got exactly the same result on the first run as on the fifth and the run-to-run variance was very low. With NUMA off the result tended to increase with the first few runs and the difference between the first and fifth run was ~3 %.
Re: NUMA Testing
That's that sorted then :-)xeny wrote: ↑Sat Dec 07, 2024 9:17 pmThe "latest" release of the bootloader EEPROM now have it enabled by default, so if you're happy using that, no need to remember anything.
https://github.com/raspberrypi/rpi-eepr ... e-notes.md . -2711 for the Pi 4 is the same.
Well at least when latest becomes default on github so it then becomes the latest in rpi-eeprom, or something like that,
I never did get the apparent mismatch in naming with these.
I think it used to be worse some years back.
- Attachments
-
- Screenshot 2024年12月08日 145628.jpg
- Screenshot 2024年12月08日 145628.jpg (353.4 KiB) Viewed 2521 times
- DanielLi64
- Posts: 4
- Joined: Sat Nov 30, 2024 10:02 pm
Re: NUMA Testing
Setting SDRAM_BANKLOW=1 is not ideal depending on what your using your PI for. It would be very interesting to see your Quake benchmarks with SDRAM_BANKLOW=2 and SDRAM_BANKLOW=3 (even faster writes).Mikael wrote: ↑Sun Dec 08, 2024 10:08 amDid some tests with Yamagi Quake 2 (OpenGL ES3) and vkQuake3 (Vulkan) on my Pi 5 8GB. Fully updated installation of Pi OS:
Code: Select all
Linux raspberrypi 6.6.62+rpt-rpi-2712 #1 SMP PREEMPT Debian 1:6.6.62-1+rpt1 (2024年11月25日) aarch64 GNU/LinuxCode: Select all
BOOTLOADER: up to date CURRENT: Tue 12 Nov 16:10:44 UTC 2024 (1731427844) LATEST: Tue 12 Nov 16:10:44 UTC 2024 (1731427844)These are some pretty big regressions (and the first I've seen with NUMA enabled, except for the synthetic write benchmark mentioned earlier). The only change between these tests is that I removed the "SDRAM_BANKLOW=1" row from the EEPROM config for the NUMA off tests.Code: Select all
NUMA off NUMA on Yamagi Quake 2 (1080p, GLES3, 16xAF) 102,0 95,0 vkQuake3 (1080p, Vulkan) 168,5 148,0
EDIT: Just an observation: At least for the Quake 2 results, the results seemed more consistent with NUMA on. I got exactly the same result on the first run as on the fifth and the run-to-run variance was very low. With NUMA off the result tended to increase with the first few runs and the difference between the first and fifth run was ~3 %.
Some numbers:
tinymembench v0.4.10
--- SDRAM_BANKLOW=1 ---
standard memcpy : 5805.6 MB/s (0.6%)
standard memset : 9981.5 MB/s (1.0%)
--- SDRAM_BANKLOW=2 ---
standard memcpy : 6560.5 MB/s (0.7%)
standard memset : 16119.0 MB/s (1.8%)
glmark2 2023.01
--- SDRAM_BANKLOW=1 ---
glmark2 Score: 1990
--- SDRAM_BANKLOW=2 ---
glmark2 Score: 2270
Geekbench 6
--- SDRAM_BANKLOW=1 ---
https://browser.geekbench.com/v6/cpu/9313128
Single=1108
Multi=2402
--- SDRAM_BANKLOW=2 ---
https://browser.geekbench.com/v6/cpu/9312651
Single = 1102
Multi = 2345
And the lot:
=======================================================
SDRAM_BANKLOW=2
=======================================================
tinymembench v0.4.10 (simple benchmark for memory throughput and latency)
==========================================================================
== Memory bandwidth tests ==
== ==
== Note 1: 1MB = 1000000 bytes ==
== Note 2: Results for 'copy' tests show how many bytes can be ==
== copied per second (adding together read and writen ==
== bytes would have provided twice higher numbers) ==
== Note 3: 2-pass copy means that we are using a small temporary buffer ==
== to first fetch data into it, and only then write it to the ==
== destination (source -> L1 cache, L1 cache -> destination) ==
== Note 4: If sample standard deviation exceeds 0.1%, it is shown in ==
== brackets ==
==========================================================================
C copy backwards : 6288.8 MB/s (0.6%)
C copy backwards (32 byte blocks) : 6270.4 MB/s (0.3%)
C copy backwards (64 byte blocks) : 6271.8 MB/s (0.3%)
C copy : 6598.0 MB/s (0.7%)
C copy prefetched (32 bytes step) : 6617.8 MB/s (0.7%)
C copy prefetched (64 bytes step) : 6615.7 MB/s (0.7%)
C 2-pass copy : 5818.3 MB/s (0.3%)
C 2-pass copy prefetched (32 bytes step) : 6642.0 MB/s (0.9%)
C 2-pass copy prefetched (64 bytes step) : 6650.3 MB/s (1.4%)
C fill : 16067.0 MB/s (1.6%)
C fill (shuffle within 16 byte blocks) : 16068.3 MB/s (1.5%)
C fill (shuffle within 32 byte blocks) : 16157.3 MB/s (1.7%)
C fill (shuffle within 64 byte blocks) : 16138.8 MB/s (1.6%)
NEON 64x2 COPY : 6537.3 MB/s (0.5%)
NEON 64x2x4 COPY : 6544.9 MB/s (0.5%)
NEON 64x1x4_x2 COPY : 6548.1 MB/s (0.6%)
NEON 64x2 COPY prefetch x2 : 6291.5 MB/s (0.5%)
NEON 64x2x4 COPY prefetch x1 : 6453.7 MB/s (0.5%)
NEON 64x2 COPY prefetch x1 : 6206.8 MB/s (0.4%)
NEON 64x2x4 COPY prefetch x1 : 6451.2 MB/s (0.5%)
---
standard memcpy : 6560.5 MB/s (0.7%)
standard memset : 16119.0 MB/s (1.8%)
---
NEON LDP/STP copy : 6551.7 MB/s (0.6%)
NEON LDP/STP copy pldl2strm (32 bytes step) : 6588.2 MB/s (0.5%)
NEON LDP/STP copy pldl2strm (64 bytes step) : 6587.2 MB/s (0.7%)
NEON LDP/STP copy pldl1keep (32 bytes step) : 6578.2 MB/s (0.7%)
NEON LDP/STP copy pldl1keep (64 bytes step) : 6578.3 MB/s (0.6%)
NEON LD1/ST1 copy : 6545.9 MB/s (0.7%)
NEON STP fill : 16130.8 MB/s (1.9%)
NEON STNP fill : 16088.6 MB/s (1.7%)
ARM LDP/STP copy : 6544.1 MB/s (0.8%)
ARM STP fill : 16077.2 MB/s (1.8%)
ARM STNP fill : 16101.3 MB/s (1.9%)
==========================================================================
== Framebuffer read tests. ==
== ==
== Many ARM devices use a part of the system memory as the framebuffer, ==
== typically mapped as uncached but with write-combining enabled. ==
== Writes to such framebuffers are quite fast, but reads are much ==
== slower and very sensitive to the alignment and the selection of ==
== CPU instructions which are used for accessing memory. ==
== ==
== Many x86 systems allocate the framebuffer in the GPU memory, ==
== accessible for the CPU via a relatively slow PCI-E bus. Moreover, ==
== PCI-E is asymmetric and handles reads a lot worse than writes. ==
== ==
== If uncached framebuffer reads are reasonably fast (at least 100 MB/s ==
== or preferably >300 MB/s), then using the shadow framebuffer layer ==
== is not necessary in Xorg DDX drivers, resulting in a nice overall ==
== performance improvement. For example, the xf86-video-fbturbo DDX ==
== uses this trick. ==
==========================================================================
NEON LDP/STP copy (from framebuffer) : 1814.6 MB/s (0.4%)
NEON LDP/STP 2-pass copy (from framebuffer) : 1628.9 MB/s (0.7%)
NEON LD1/ST1 copy (from framebuffer) : 1823.5 MB/s (0.4%)
NEON LD1/ST1 2-pass copy (from framebuffer) : 1633.5 MB/s (0.6%)
ARM LDP/STP copy (from framebuffer) : 1820.1 MB/s (0.4%)
ARM LDP/STP 2-pass copy (from framebuffer) : 1631.2 MB/s (0.6%)
==========================================================================
== Memory latency test ==
== ==
== Average time is measured for random memory accesses in the buffers ==
== of different sizes. The larger is the buffer, the more significant ==
== are relative contributions of TLB, L1/L2 cache misses and SDRAM ==
== accesses. For extremely large buffer sizes we are expecting to see ==
== page table walk with several requests to SDRAM for almost every ==
== memory access (though 64MiB is not nearly large enough to experience ==
== this effect to its fullest). ==
== ==
== Note 1: All the numbers are representing extra time, which needs to ==
== be added to L1 cache latency. The cycle timings for L1 cache ==
== latency can be usually found in the processor documentation. ==
== Note 2: Dual random read means that we are simultaneously performing ==
== two independent memory accesses at a time. In the case if ==
== the memory subsystem can't handle multiple outstanding ==
== requests, dual random read has the same timings as two ==
== single reads performed one after another. ==
==========================================================================
block size : single random read / dual random read
1024 : 0.0 ns / 0.0 ns
2048 : 0.0 ns / 0.0 ns
4096 : 0.0 ns / 0.0 ns
8192 : 0.0 ns / 0.0 ns
16384 : 0.0 ns / 0.0 ns
32768 : 0.0 ns / 0.0 ns
65536 : 0.0 ns / 0.0 ns
131072 : 0.8 ns / 1.1 ns
262144 : 1.2 ns / 1.5 ns
524288 : 2.1 ns / 2.7 ns
1048576 : 6.5 ns / 8.7 ns
2097152 : 10.1 ns / 12.5 ns
4194304 : 45.0 ns / 67.9 ns
8388608 : 71.2 ns / 95.2 ns
16777216 : 84.5 ns / 104.2 ns
33554432 : 92.8 ns / 108.7 ns
67108864 : 97.5 ns / 111.4 ns
=======================================================
glmark2 2023.01
=======================================================
OpenGL Information
GL_VENDOR: Broadcom
GL_RENDERER: V3D 7.1
GL_VERSION: 3.1 Mesa 23.2.1-1~bpo12+rpt3
Surface Config: buf=32 r=8 g=8 b=8 a=8 depth=24 stencil=0 samples=0
Surface Size: 800x600 windowed
=======================================================
[build] use-vbo=false: FPS: 2821 FrameTime: 0.355 ms
[build] use-vbo=true: FPS: 3790 FrameTime: 0.264 ms
[texture] texture-filter=nearest: FPS: 3002 FrameTime: 0.333 ms
[texture] texture-filter=linear: FPS: 2980 FrameTime: 0.336 ms
[texture] texture-filter=mipmap: FPS: 3096 FrameTime: 0.323 ms
[shading] shading=gouraud: FPS: 3121 FrameTime: 0.320 ms
[shading] shading=blinn-phong-inf: FPS: 2961 FrameTime: 0.338 ms
[shading] shading=phong: FPS: 2589 FrameTime: 0.386 ms
[shading] shading=cel: FPS: 2534 FrameTime: 0.395 ms
[bump] bump-render=high-poly: FPS: 1626 FrameTime: 0.615 ms
[bump] bump-render=normals: FPS: 3479 FrameTime: 0.288 ms
[bump] bump-render=height: FPS: 3266 FrameTime: 0.306 ms
[effect2d] kernel=0,1,0;1,-4,1;0,1,0;: FPS: 1438 FrameTime: 0.696 ms
[effect2d] kernel=1,1,1,1,1;1,1,1,1,1;1,1,1,1,1;: FPS: 624 FrameTime: 1.603 ms
[pulsar] light=false:quads=5:texture=false: FPS: 3648 FrameTime: 0.274 ms
[desktop] blur-radius=5:effect=blur:passes=1:separable=true:windows=4: FPS: 328 FrameTime: 3.055 ms
[desktop] effect=shadow:windows=4: FPS: 1196 FrameTime: 0.836 ms
[buffer] columns=200:interleave=false:update-dispersion=0.9:update-fraction=0.5:update-method=map: FPS: 585 FrameTime: 1.712 ms
[buffer] columns=200:interleave=false:update-dispersion=0.9:update-fraction=0.5:update-method=subdata: FPS: 524 FrameTime: 1.909 ms
[buffer] columns=200:interleave=true:update-dispersion=0.9:update-fraction=0.5:update-method=map: FPS: 627 FrameTime: 1.595 ms
[ideas] speed=duration: FPS: 2734 FrameTime: 0.366 ms
[jellyfish] <default>: FPS: 1588 FrameTime: 0.630 ms
[terrain] <default>: FPS: 91 FrameTime: 11.076 ms
[shadow] <default>: FPS: 175 FrameTime: 5.728 ms
[refract] <default>: FPS: 85 FrameTime: 11.845 ms
[conditionals] fragment-steps=0:vertex-steps=0: FPS: 4118 FrameTime: 0.243 ms
[conditionals] fragment-steps=5:vertex-steps=0: FPS: 2971 FrameTime: 0.337 ms
[conditionals] fragment-steps=0:vertex-steps=5: FPS: 4011 FrameTime: 0.249 ms
[function] fragment-complexity=low:fragment-steps=5: FPS: 3487 FrameTime: 0.287 ms
[function] fragment-complexity=medium:fragment-steps=5: FPS: 2452 FrameTime: 0.408 ms
[loop] fragment-loop=false:fragment-steps=5:vertex-steps=5: FPS: 3378 FrameTime: 0.296 ms
[loop] fragment-steps=5:fragment-uniform=false:vertex-steps=5: FPS: 3347 FrameTime: 0.299 ms
[loop] fragment-steps=5:fragment-uniform=true:vertex-steps=5: FPS: 2287 FrameTime: 0.437 ms
=======================================================
glmark2 Score: 2270
=======================================================
Geekbench 6
https://browser.geekbench.com/v6/cpu/9312651
Single = 1102
Multi = 2345
=======================================================
SDRAM_BANKLOW=1
=======================================================
tinymembench v0.4.10 (simple benchmark for memory throughput and latency)
==========================================================================
== Memory bandwidth tests ==
== ==
== Note 1: 1MB = 1000000 bytes ==
== Note 2: Results for 'copy' tests show how many bytes can be ==
== copied per second (adding together read and writen ==
== bytes would have provided twice higher numbers) ==
== Note 3: 2-pass copy means that we are using a small temporary buffer ==
== to first fetch data into it, and only then write it to the ==
== destination (source -> L1 cache, L1 cache -> destination) ==
== Note 4: If sample standard deviation exceeds 0.1%, it is shown in ==
== brackets ==
==========================================================================
C copy backwards : 5953.7 MB/s (0.6%)
C copy backwards (32 byte blocks) : 6040.8 MB/s (0.9%)
C copy backwards (64 byte blocks) : 5993.7 MB/s (0.3%)
C copy : 5844.0 MB/s (0.4%)
C copy prefetched (32 bytes step) : 5845.7 MB/s (0.4%)
C copy prefetched (64 bytes step) : 5838.4 MB/s (0.5%)
C 2-pass copy : 5779.0 MB/s (0.4%)
C 2-pass copy prefetched (32 bytes step) : 5835.9 MB/s (0.8%)
C 2-pass copy prefetched (64 bytes step) : 5839.4 MB/s (0.8%)
C fill : 9959.1 MB/s (0.9%)
C fill (shuffle within 16 byte blocks) : 9982.9 MB/s (1.0%)
C fill (shuffle within 32 byte blocks) : 9977.9 MB/s (0.9%)
C fill (shuffle within 64 byte blocks) : 9965.7 MB/s (0.9%)
NEON 64x2 COPY : 5765.3 MB/s (0.4%)
NEON 64x2x4 COPY : 5793.1 MB/s (0.5%)
NEON 64x1x4_x2 COPY : 5787.9 MB/s (0.6%)
NEON 64x2 COPY prefetch x2 : 5676.1 MB/s (0.5%)
NEON 64x2x4 COPY prefetch x1 : 5782.0 MB/s (0.5%)
NEON 64x2 COPY prefetch x1 : 5641.1 MB/s (0.5%)
NEON 64x2x4 COPY prefetch x1 : 5781.3 MB/s (0.5%)
---
standard memcpy : 5805.6 MB/s (0.6%)
standard memset : 9981.5 MB/s (1.0%)
---
NEON LDP/STP copy : 5788.9 MB/s (0.6%)
NEON LDP/STP copy pldl2strm (32 bytes step) : 5907.8 MB/s (0.5%)
NEON LDP/STP copy pldl2strm (64 bytes step) : 5907.5 MB/s (0.5%)
NEON LDP/STP copy pldl1keep (32 bytes step) : 5897.8 MB/s (0.5%)
NEON LDP/STP copy pldl1keep (64 bytes step) : 5898.5 MB/s (0.5%)
NEON LD1/ST1 copy : 5789.9 MB/s (0.5%)
NEON STP fill : 9978.4 MB/s (1.1%)
NEON STNP fill : 9983.6 MB/s (0.9%)
ARM LDP/STP copy : 5779.9 MB/s (0.5%)
ARM STP fill : 9972.8 MB/s (1.0%)
ARM STNP fill : 9953.1 MB/s (1.0%)
==========================================================================
== Framebuffer read tests. ==
== ==
== Many ARM devices use a part of the system memory as the framebuffer, ==
== typically mapped as uncached but with write-combining enabled. ==
== Writes to such framebuffers are quite fast, but reads are much ==
== slower and very sensitive to the alignment and the selection of ==
== CPU instructions which are used for accessing memory. ==
== ==
== Many x86 systems allocate the framebuffer in the GPU memory, ==
== accessible for the CPU via a relatively slow PCI-E bus. Moreover, ==
== PCI-E is asymmetric and handles reads a lot worse than writes. ==
== ==
== If uncached framebuffer reads are reasonably fast (at least 100 MB/s ==
== or preferably >300 MB/s), then using the shadow framebuffer layer ==
== is not necessary in Xorg DDX drivers, resulting in a nice overall ==
== performance improvement. For example, the xf86-video-fbturbo DDX ==
== uses this trick. ==
==========================================================================
NEON LDP/STP copy (from framebuffer) : 1796.6 MB/s (0.5%)
NEON LDP/STP 2-pass copy (from framebuffer) : 1614.1 MB/s (0.6%)
NEON LD1/ST1 copy (from framebuffer) : 1807.0 MB/s (0.6%)
NEON LD1/ST1 2-pass copy (from framebuffer) : 1625.4 MB/s (0.7%)
ARM LDP/STP copy (from framebuffer) : 1801.5 MB/s (0.5%)
ARM LDP/STP 2-pass copy (from framebuffer) : 1621.6 MB/s (0.7%)
==========================================================================
== Memory latency test ==
== ==
== Average time is measured for random memory accesses in the buffers ==
== of different sizes. The larger is the buffer, the more significant ==
== are relative contributions of TLB, L1/L2 cache misses and SDRAM ==
== accesses. For extremely large buffer sizes we are expecting to see ==
== page table walk with several requests to SDRAM for almost every ==
== memory access (though 64MiB is not nearly large enough to experience ==
== this effect to its fullest). ==
== ==
== Note 1: All the numbers are representing extra time, which needs to ==
== be added to L1 cache latency. The cycle timings for L1 cache ==
== latency can be usually found in the processor documentation. ==
== Note 2: Dual random read means that we are simultaneously performing ==
== two independent memory accesses at a time. In the case if ==
== the memory subsystem can't handle multiple outstanding ==
== requests, dual random read has the same timings as two ==
== single reads performed one after another. ==
==========================================================================
block size : single random read / dual random read
1024 : 0.0 ns / 0.0 ns
2048 : 0.0 ns / 0.0 ns
4096 : 0.0 ns / 0.0 ns
8192 : 0.0 ns / 0.0 ns
16384 : 0.0 ns / 0.0 ns
32768 : 0.0 ns / 0.0 ns
65536 : 0.0 ns / 0.0 ns
131072 : 0.8 ns / 1.1 ns
262144 : 1.2 ns / 1.5 ns
524288 : 1.5 ns / 1.7 ns
1048576 : 6.5 ns / 8.7 ns
2097152 : 9.5 ns / 11.3 ns
4194304 : 45.0 ns / 67.9 ns
8388608 : 71.4 ns / 95.2 ns
16777216 : 84.6 ns / 104.2 ns
33554432 : 92.9 ns / 108.7 ns
67108864 : 97.6 ns / 111.4 ns
=======================================================
glmark2 2023.01
=======================================================
OpenGL Information
GL_VENDOR: Broadcom
GL_RENDERER: V3D 7.1
GL_VERSION: 3.1 Mesa 23.2.1-1~bpo12+rpt3
Surface Config: buf=32 r=8 g=8 b=8 a=8 depth=24 stencil=0 samples=0
Surface Size: 800x600 windowed
=======================================================
[build] use-vbo=false: FPS: 2443 FrameTime: 0.409 ms
[build] use-vbo=true: FPS: 3296 FrameTime: 0.303 ms
[texture] texture-filter=nearest: FPS: 2627 FrameTime: 0.381 ms
[texture] texture-filter=linear: FPS: 2594 FrameTime: 0.386 ms
[texture] texture-filter=mipmap: FPS: 2695 FrameTime: 0.371 ms
[shading] shading=gouraud: FPS: 2748 FrameTime: 0.364 ms
[shading] shading=blinn-phong-inf: FPS: 2614 FrameTime: 0.383 ms
[shading] shading=phong: FPS: 2269 FrameTime: 0.441 ms
[shading] shading=cel: FPS: 2237 FrameTime: 0.447 ms
[bump] bump-render=high-poly: FPS: 1402 FrameTime: 0.714 ms
[bump] bump-render=normals: FPS: 3034 FrameTime: 0.330 ms
[bump] bump-render=height: FPS: 2832 FrameTime: 0.353 ms
[effect2d] kernel=0,1,0;1,-4,1;0,1,0;: FPS: 1260 FrameTime: 0.794 ms
[effect2d] kernel=1,1,1,1,1;1,1,1,1,1;1,1,1,1,1;: FPS: 551 FrameTime: 1.817 ms
[pulsar] light=false:quads=5:texture=false: FPS: 3196 FrameTime: 0.313 ms
[desktop] blur-radius=5:effect=blur:passes=1:separable=true:windows=4: FPS: 288 FrameTime: 3.483 ms
[desktop] effect=shadow:windows=4: FPS: 1038 FrameTime: 0.964 ms
[buffer] columns=200:interleave=false:update-dispersion=0.9:update-fraction=0.5:update-method=map: FPS: 481 FrameTime: 2.079 ms
[buffer] columns=200:interleave=false:update-dispersion=0.9:update-fraction=0.5:update-method=subdata: FPS: 498 FrameTime: 2.010 ms
[buffer] columns=200:interleave=true:update-dispersion=0.9:update-fraction=0.5:update-method=map: FPS: 533 FrameTime: 1.878 ms
[ideas] speed=duration: FPS: 2439 FrameTime: 0.410 ms
[jellyfish] <default>: FPS: 1413 FrameTime: 0.708 ms
[terrain] <default>: FPS: 81 FrameTime: 12.423 ms
[shadow] <default>: FPS: 156 FrameTime: 6.443 ms
[refract] <default>: FPS: 78 FrameTime: 12.884 ms
[conditionals] fragment-steps=0:vertex-steps=0: FPS: 3569 FrameTime: 0.280 ms
[conditionals] fragment-steps=5:vertex-steps=0: FPS: 2610 FrameTime: 0.383 ms
[conditionals] fragment-steps=0:vertex-steps=5: FPS: 3529 FrameTime: 0.283 ms
[function] fragment-complexity=low:fragment-steps=5: FPS: 3068 FrameTime: 0.326 ms
[function] fragment-complexity=medium:fragment-steps=5: FPS: 2157 FrameTime: 0.464 ms
[loop] fragment-loop=false:fragment-steps=5:vertex-steps=5: FPS: 2973 FrameTime: 0.336 ms
[loop] fragment-steps=5:fragment-uniform=false:vertex-steps=5: FPS: 2987 FrameTime: 0.335 ms
[loop] fragment-steps=5:fragment-uniform=true:vertex-steps=5: FPS: 2027 FrameTime: 0.494 ms
=======================================================
glmark2 Score: 1990
=======================================================
https://browser.geekbench.com/v6/cpu/9313128
Single=1108
Multi=2402
Re: NUMA Testing
Ther seems to be an issue both with official kernel 6.12 via rpi-update next as well as my self-compiled kernel 6.13:
With 6.6, I could see the usual 8 memory regions, instead.
And yes, I do have SDRAM_BANKLOW=1 giving me numa_policy=interleave numa=fake=8 on my Pi5
Thanks for your help!
P.S.: Official kernel 6.6.62 gives me
Code: Select all
$ dmesg | grep NUMA
[ 0.000000] NUMA: Faking a node at [mem 0x0000000000000000-0x00000001ffffffff]
[ 0.000000] NUMA: Initialized distance table, cnt=8
[ 0.000000] mempolicy: NUMA default policy overridden to 'interleave:0-7'
[ 0.294843] pci_bus 0000:00: Unknown NUMA node; performance will be reduced
[ 0.432681] pci_bus 0001:00: Unknown NUMA node; performance will be reducedAnd yes, I do have SDRAM_BANKLOW=1 giving me numa_policy=interleave numa=fake=8 on my Pi5
Thanks for your help!
P.S.: Official kernel 6.6.62 gives me
Code: Select all
$ dmesg | grep NUMA
[ 0.000000] NUMA: No NUMA configuration found
[ 0.000000] NUMA: NODE_DATA [mem 0x3fbfd2c0-0x3fbfffff]
[ 0.000000] NUMA: NODE_DATA [mem 0x7fffd2c0-0x7fffffff]
[ 0.000000] NUMA: NODE_DATA [mem 0xbfffd2c0-0xbfffffff]
[ 0.000000] NUMA: NODE_DATA [mem 0xffffd2c0-0xffffffff]
[ 0.000000] NUMA: NODE_DATA [mem 0x13fffd2c0-0x13fffffff]
[ 0.000000] NUMA: NODE_DATA [mem 0x17fffd2c0-0x17fffffff]
[ 0.000000] NUMA: NODE_DATA [mem 0x1bfffd2c0-0x1bfffffff]
[ 0.000000] NUMA: NODE_DATA [mem 0x1ffb892c0-0x1ffb8bfff]
[ 0.000000] mempolicy: NUMA default policy overridden to 'interleave:0-7'
[ 0.229644] pci_bus 0000:00: Unknown NUMA node; performance will be reduced
[ 0.381417] pci_bus 0001:00: Unknown NUMA node; performance will be reduced- dom
- Raspberry Pi Engineer & Forum Moderator
Raspberry Pi Engineer & Forum Moderator - Posts: 8472
- Joined: Wed Aug 17, 2011 7:41 pm
Re: NUMA Testing
The implementation of fake NUMA on 6.12 (and later) is different, hence the difference in dmesg output.
The important line is:
Code: Select all
[ 0.000000] mempolicy: NUMA default policy overridden to 'interleave:0-7'
Re: NUMA Testing
I think that a trip into rpi-config , option 6, Bootloader version and specifying latest there, will result in rpi-eeprom installing the latest version.
Re: NUMA Testing
Re: NUMA Testing
Here's a more systematic analysis of the memory bandwidth under different parameters:
- bandwidth-1-16M-RPI5_numa_corrected_threads-1.png
- bandwidth-1-16M-RPI5_numa_corrected_threads-1.png (310.32 KiB) Viewed 2248 times
Re: NUMA Testing
Active cooling. I am assuming it’s adequate, though I would need to control over thermal throttling.
Re: NUMA Testing
For me the data point that looks surprising is 4M random access write. That the only one where 2 cores seemed to experience a noticeably greater performance increase with NUMA than 4 cores.
Have you set the CPU governor to performance?
Re: NUMA Testing
I'm now using the fake NUMA patch, having updated to the latest EEPROM firmware and kernel. I'm seeing these two messages in dmesg (after the interleave 0-7 one which confirms fake NUMA has done it's thang):
What do these mean - is there actually a performance hit, or is it just assumed there will be?
Code: Select all
[ 0.268246] pci_bus 0000:00: Unknown NUMA node; performance will be reduced
[ 0.399843] pci_bus 0001:00: Unknown NUMA node; performance will be reduced
Jump to
- Community
- General discussion
- Announcements
- Other languages
- Deutsch
- Español
- Français
- Italiano
- Nederlands
- 日本語
- Polski
- Português
- Русский
- Türkçe
- User groups and events
- Raspberry Pi Official Magazine
- Using the Raspberry Pi
- Beginners
- Troubleshooting
- Advanced users
- Assistive technology and accessibility
- Education
- Picademy
- Teaching and learning resources
- Staffroom, classroom and projects
- Astro Pi
- Mathematica
- High Altitude Balloon
- Weather station
- Programming
- C/C++
- Java
- Python
- Scratch
- Other programming languages
- Windows 10 for IoT
- Wolfram Language
- Bare metal, Assembly language
- Graphics programming
- OpenGLES
- OpenVG
- OpenMAX
- General programming discussion
- Projects
- Networking and servers
- Automation, sensing and robotics
- Graphics, sound and multimedia
- Other projects
- Media centres
- Gaming
- AIY Projects
- Hardware and peripherals
- Camera board
- Compute Module
- Official Display
- HATs and other add-ons
- Device Tree
- Interfacing (DSI, CSI, I2C, etc.)
- Keyboard computers (400, 500, 500+)
- Raspberry Pi Pico
- General
- SDK
- MicroPython
- Other RP2040 boards
- Zephyr
- Rust
- AI Accelerator
- AI Camera - IMX500
- Hailo
- Software
- Raspberry Pi OS
- Raspberry Pi Connect
- Raspberry Pi Desktop for PC and Mac
- Beta testing
- Other
- Android
- Debian
- FreeBSD
- Gentoo
- Linux Kernel
- NetBSD
- openSUSE
- Plan 9
- Puppy
- Arch
- Pidora / Fedora
- RISCOS
- Ubuntu
- Ye Olde Pi Shoppe
- For sale
- Wanted
- Off topic
- Off topic discussion