NUMA Testing

Locked

Print view

143 posts

User avatar geerlingguy: Posts: 585; Joined: Sun Feb 15, 2015 3:43 am

Re: NUMA Testing

Sat Dec 07, 2024 3:17 am

bytter wrote: ↑
Sat Dec 07, 2024 12:00 am
Thoughts?

That's in line with the other memory specific tests posted in the thread. Have you tried any other more realistic benchmarks, or any of your own software, to see if there are speedups with those usage patterns?

The question is not whether something should be done on a Raspberry Pi, it is whether it can be done on a Raspberry Pi.

ejolson: Posts: 13865; Joined: Tue Mar 18, 2014 11:47 am

Re: NUMA Testing

Sat Dec 07, 2024 3:24 am

bytter wrote: ↑
Sat Dec 07, 2024 12:00 am
I believe I'm part of the unlucky ones that are observing a regression here.

Thoughts?

The calculation

8575.85 / 12319.23 = 0.696135229

indicates that NUMA is 30 percent slower. This is surprising since multi-core bandwidth-constrained code is supposed to be optimised by the patch.

Do different batches of Raspberry Pi have other memory chips that behave differently?

Oh, wait. Try running the test with four cores.

bytter: Posts: 5; Joined: Fri Dec 06, 2024 11:45 pm

Re: NUMA Testing

Sat Dec 07, 2024 12:24 pm

Let me start by describing what I have here: two Raspberry Pi 5's 8Gb. Both of them from the same batch (Pi store in Cambridge UK, before people started receiving them at home). One has meaningful workloads running in the background, and the other is completely blank. Both exhibit the same results, baring statistical noise.

And you are right, I neglected the number of threads. The results are still weird, but here they come:

No SDRAM_BANKLOW=1. 4 threads, increasing the total memory size tenfold to 100G, 1G blocks:

Code: Select all

sysbench memory --memory-block-size=1G --memory-total-size=100G --threads=4 --memory-oper=write run
sysbench 1.0.20 (using system LuaJIT 2.1.0-beta3)
Running the test with following options:
Number of threads: 4
Initializing random number generator from current time
Running memory speed test with the following options:
 block size: 1048576KiB
 total size: 102400MiB
 operation: write
 scope: global
Initializing worker threads...
Threads started!
Total operations: 72 ( 6.96 per second)
73728.00 MiB transferred (7131.56 MiB/sec)
General statistics:
 total time: 10.3370s
 total number of events: 72
Latency (ms):
 min: 408.45
 avg: 571.50
 max: 607.71
 95th percentile: 601.29
 sum: 41148.34
Threads fairness:
 events (avg/stddev): 18.0000/1.00
 execution time (avg/stddev): 10.2871/0.05

No SDRAM_BANKLOW=1. 1 thread, total memory size 100G, 1G blocks:

Code: Select all

sysbench memory --memory-block-size=1G --memory-total-size=100G --threads=1 --memory-oper=write run
sysbench 1.0.20 (using system LuaJIT 2.1.0-beta3)
Running the test with following options:
Number of threads: 1
Initializing random number generator from current time
Running memory speed test with the following options:
 block size: 1048576KiB
 total size: 102400MiB
 operation: write
 scope: global
Initializing worker threads...
Threads started!
Total operations: 100 ( 12.00 per second)
102400.00 MiB transferred (12283.80 MiB/sec)
General statistics:
 total time: 8.3350s
 total number of events: 100
Latency (ms):
 min: 82.68
 avg: 83.35
 max: 85.03
 95th percentile: 84.47
 sum: 8334.53
Threads fairness:
 events (avg/stddev): 100.0000/0.00
 execution time (avg/stddev): 8.3345/0.00

With SDRAM_BANKLOW=1. 4 threads, total memory size 100G, 1G blocks:

Code: Select all

sysbench memory --memory-block-size=1G --memory-total-size=100G --threads=4 --memory-oper=write run
sysbench 1.0.20 (using system LuaJIT 2.1.0-beta3)
Running the test with following options:
Number of threads: 4
Initializing random number generator from current time
Running memory speed test with the following options:
 block size: 1048576KiB
 total size: 102400MiB
 operation: write
 scope: global
Initializing worker threads...
Threads started!
Total operations: 86 ( 8.39 per second)
88064.00 MiB transferred (8593.18 MiB/sec)
General statistics:
 total time: 10.2470s
 total number of events: 86
Latency (ms):
 min: 265.19
 avg: 472.53
 max: 719.95
 95th percentile: 612.21
 sum: 40637.52
Threads fairness:
 events (avg/stddev): 21.5000/0.50
 execution time (avg/stddev): 10.1594/0.06

With SDRAM_BANKLOW=1. 1 thread, total memory size 100G, 1G blocks:

Code: Select all

sysbench memory --memory-block-size=1G --memory-total-size=100G --threads=1 --memory-oper=write run
sysbench 1.0.20 (using system LuaJIT 2.1.0-beta3)
Running the test with following options:
Number of threads: 1
Initializing random number generator from current time
Running memory speed test with the following options:
 block size: 1048576KiB
 total size: 102400MiB
 operation: write
 scope: global
Initializing worker threads...
Threads started!
Total operations: 84 ( 8.38 per second)
86016.00 MiB transferred (8581.84 MiB/sec)
General statistics:
 total time: 10.0216s
 total number of events: 84
Latency (ms):
 min: 113.32
 avg: 119.30
 max: 122.29
 95th percentile: 118.92
 sum: 10021.14
Threads fairness:
 events (avg/stddev): 84.0000/0.00
 execution time (avg/stddev): 10.0211/0.00

I also tested with 2 threads. Now, I can certainly do this more systematically (and running multiple experiments in multiple workloads). I am not so concerned about the overall performance improvement of the Pi 5, as I am trying to understand the memory impact here:

Code: Select all

| NUMA / Threads | 1 | 2 | 4 |
|----------------|-----------|-----------|-----------|
| Off | 12,283.80 | 9,162.35 | 7,131.56 |
| On | 8,581.84 | 9,208.26 | 8,593.18 |
| Gain/Loss | -30.14% | 0.50% | 20.50% |

I have my own interpretation of this data, but would love to hear your thoughts. As for the concept of realistically measuring this without synthetic benchmarks, I will refrain from commenting on the likes of Geekbench. Here's something tangible: how much time does it take for the PI (full of services) to boot:

NUMA enabled:

Code: Select all

Startup finished in 3.482s (kernel) + 10.074s (userspace) = 13.556s
multi-user.target reached after 10.052s in userspace.

NUMA disabled:

Code: Select all

Startup finished in 3.593s (kernel) + 10.134s (userspace) = 13.728s
multi-user.target reached after 10.099s in userspace

I wouldn't read too much on the 3% uplift on the kernel (or the 0.4% in userspace). In my world this is statistical noise.

User avatar geerlingguy: Posts: 585; Joined: Sun Feb 15, 2015 3:43 am

Re: NUMA Testing

Sat Dec 07, 2024 1:20 pm

bytter wrote: ↑
Sat Dec 07, 2024 12:24 pm
I have my own interpretation of this data, but would love to hear your thoughts. As for the concept of realistically measuring this without synthetic benchmarks, I will refrain from commenting on the likes of Geekbench.

I've been running a full suite of tests from ollama/LLMs, to pts linux-kernel-recompile, 4K and 1080p x264 transcoding, HPL, and more and all the tests show a significant boost (between 12-30%) when running with the new settings.

So far I've yet to see any testing result in a regression, except for memory bandwidth tests (I used tinymembench). Would be interesting to see exactly why that's the case!

The question is not whether something should be done on a Raspberry Pi, it is whether it can be done on a Raspberry Pi.

ejolson: Posts: 13865; Joined: Tue Mar 18, 2014 11:47 am

Re: NUMA Testing

Sat Dec 07, 2024 1:29 pm

bytter wrote: ↑
Sat Dec 07, 2024 12:24 pm
I am not so concerned about the overall performance improvement of the Pi 5, as I am trying to understand the memory impact here:
Code: Select all
| NUMA / Threads | 1 | 2 | 4 |
|----------------|-----------|-----------|-----------|
| Off | 12,283.80 | 9,162.35 | 7,131.56 |
| On | 8,581.84 | 9,208.26 | 8,593.18 |
| Gain/Loss | -30.14% | 0.50% | 20.50% |
I have my own interpretation of this data, but would love to hear your thoughts.

I like your table. My interpretation is the NUMA allocator is working as expected--better when all cores are active but worse for single threaded.

bytter: Posts: 5; Joined: Fri Dec 06, 2024 11:45 pm

Re: NUMA Testing

Sat Dec 07, 2024 1:32 pm

geerlingguy wrote: ↑
Sat Dec 07, 2024 1:20 pm
I've been running a full suite of tests from ollama/LLMs, to pts linux-kernel-recompile, 4K and 1080p x264 transcoding, HPL, and more and all the tests show a significant boost (between 12-30%) when running with the new settings.

So far I've yet to see any testing result in a regression, except for memory bandwidth tests (I used tinymembench). Would be interesting to see exactly why that's the case!

What do all of those tests share in common (in fact, what are the typical uses for non-faked NUMA)? Could it be due to improved memory locality, thread affinity optimisation and inherently better multi-threaded scaling?

Here's another hypothesis: look at my second column (2 threads). How do you explain that one? Could it be because there's low memory contention, and the overheads of fake NUMA balance out?

Sorry, I'm not trying to be dense here. My hypothesis is that there is some kind of workload that _benefits_ from this, but not all. All of those applications are fair benchmarks: I am curious, though, on the memory access patterns they exhibit to understand why memory bandwidth tests paint a different picture.

Nice YouTube channel, btw ;-)

User avatar bensimmo: Posts: 8140; Joined: Sun Dec 28, 2014 3:02 pm

Re: NUMA Testing

Sat Dec 07, 2024 4:30 pm

Is this going to be added to raspi-config (and Desktop Config) so I don't have to keep remembering what to edit? :oops:

solaris33: Posts: 16; Joined: Wed Jan 06, 2021 2:58 pm

Re: NUMA Testing. Is this way safe enough?

Sat Dec 07, 2024 6:08 pm

dom wrote: ↑
Sun Dec 01, 2024 6:33 pm

solaris33 wrote: ↑
Sat Nov 30, 2024 10:14 pm
I have overclock pi5 to 2.9G,without over volt.Maybe it's the reason of FAILURE?But my pi5 runs well for sevral week.
I may test more(eg. No overcloking,disable NUMA) to find the reason.
Yes, I'm sure the memtester failure is due to overclock and not NUMA.
Reduce the overclock, or add an over_voltage_delta setting until memtester passes.

YOu are right!I test again with adding 2.5mv volt-offset of cpu.And seems to good.

xeny: Posts: 50; Joined: Thu May 16, 2024 10:36 am

Re: NUMA Testing

Sat Dec 07, 2024 9:17 pm

bensimmo wrote: ↑
Sat Dec 07, 2024 4:30 pm
Is this going to be added to raspi-config (and Desktop Config) so I don't have to keep remembering what to edit? :oops:

The "latest" release of the bootloader EEPROM now have it enabled by default, so if you're happy using that, no need to remember anything.

https://github.com/raspberrypi/rpi-eepr ... e-notes.md . -2711 for the Pi 4 is the same.

xeny: Posts: 50; Joined: Thu May 16, 2024 10:36 am

Re: NUMA Testing

Sat Dec 07, 2024 9:29 pm

bytter wrote: ↑
Sat Dec 07, 2024 1:32 pm
in fact, what are the typical uses for non-faked NUMA

NUMA is a way of representing the non uniformity of CPU<->RAM performance exhibited by more complex CPU/Memory Controller/RAM topologies to the OS in an effort to increase the likelihood that the memory locations accessed by a particular process are as "near" as possible to the CPU core that process is executing on, in an effort to reduce memory latency and increase bandwidth.

The simplest example is a host with dual CPUs, each of which has associated memory controller/RAM, with an interconnect between the two CPUs.

Performance is better (and energy consumption potentially lower) if processes are mostly accessing RAM physically attached to the CPU they are allocated to, rather than having to traverse the inter CPU interconnect, which adds latency, power consumption and may be a bandwidth bottleneck in some scenarios.

Mikael: Posts: 127; Joined: Wed Feb 11, 2015 12:35 pm

Re: NUMA Testing

Sun Dec 08, 2024 10:08 am

Did some tests with Yamagi Quake 2 (OpenGL ES3) and vkQuake3 (Vulkan) on my Pi 5 8GB. Fully updated installation of Pi OS:

Code: Select all

Linux raspberrypi 6.6.62+rpt-rpi-2712 #1 SMP PREEMPT Debian 1:6.6.62-1+rpt1 (2024年11月25日) aarch64 GNU/Linux

Code: Select all

BOOTLOADER: up to date
 CURRENT: Tue 12 Nov 16:10:44 UTC 2024 (1731427844)
 LATEST: Tue 12 Nov 16:10:44 UTC 2024 (1731427844)

Code: Select all

						NUMA off	NUMA on
Yamagi Quake 2 (1080p, GLES3, 16xAF)		102,0		95,0
vkQuake3 (1080p, Vulkan)			168,5		148,0

These are some pretty big regressions (and the first I've seen with NUMA enabled, except for the synthetic write benchmark mentioned earlier). The only change between these tests is that I removed the "SDRAM_BANKLOW=1" row from the EEPROM config for the NUMA off tests.

EDIT: Just an observation: At least for the Quake 2 results, the results seemed more consistent with NUMA on. I got exactly the same result on the first run as on the fifth and the run-to-run variance was very low. With NUMA off the result tended to increase with the first few runs and the difference between the first and fifth run was ~3 %.

User avatar bensimmo: Posts: 8140; Joined: Sun Dec 28, 2014 3:02 pm

Re: NUMA Testing

Sun Dec 08, 2024 3:01 pm

xeny wrote: ↑
Sat Dec 07, 2024 9:17 pm

bensimmo wrote: ↑
Sat Dec 07, 2024 4:30 pm
Is this going to be added to raspi-config (and Desktop Config) so I don't have to keep remembering what to edit? :oops:
The "latest" release of the bootloader EEPROM now have it enabled by default, so if you're happy using that, no need to remember anything.

https://github.com/raspberrypi/rpi-eepr ... e-notes.md . -2711 for the Pi 4 is the same.

That's that sorted then :-)

Well at least when latest becomes default on github so it then becomes the latest in rpi-eeprom, or something like that,
I never did get the apparent mismatch in naming with these.

I think it used to be worse some years back.

Attachments

Screenshot 2024年12月08日 145628.jpg: Screenshot 2024年12月08日 145628.jpg (353.4 KiB) Viewed 2521 times

DanielLi64: Posts: 4; Joined: Sat Nov 30, 2024 10:02 pm

Re: NUMA Testing

Sun Dec 08, 2024 3:10 pm

Mikael wrote: ↑
Sun Dec 08, 2024 10:08 am
Did some tests with Yamagi Quake 2 (OpenGL ES3) and vkQuake3 (Vulkan) on my Pi 5 8GB. Fully updated installation of Pi OS:
Code: Select all
Linux raspberrypi 6.6.62+rpt-rpi-2712 #1 SMP PREEMPT Debian 1:6.6.62-1+rpt1 (2024年11月25日) aarch64 GNU/Linux
Code: Select all
BOOTLOADER: up to date
 CURRENT: Tue 12 Nov 16:10:44 UTC 2024 (1731427844)
 LATEST: Tue 12 Nov 16:10:44 UTC 2024 (1731427844)
Code: Select all
						NUMA off	NUMA on
Yamagi Quake 2 (1080p, GLES3, 16xAF)		102,0		95,0
vkQuake3 (1080p, Vulkan)			168,5		148,0
These are some pretty big regressions (and the first I've seen with NUMA enabled, except for the synthetic write benchmark mentioned earlier). The only change between these tests is that I removed the "SDRAM_BANKLOW=1" row from the EEPROM config for the NUMA off tests.

EDIT: Just an observation: At least for the Quake 2 results, the results seemed more consistent with NUMA on. I got exactly the same result on the first run as on the fifth and the run-to-run variance was very low. With NUMA off the result tended to increase with the first few runs and the difference between the first and fifth run was ~3 %.

Setting SDRAM_BANKLOW=1 is not ideal depending on what your using your PI for. It would be very interesting to see your Quake benchmarks with SDRAM_BANKLOW=2 and SDRAM_BANKLOW=3 (even faster writes).

Some numbers:

tinymembench v0.4.10
--- SDRAM_BANKLOW=1 ---
standard memcpy : 5805.6 MB/s (0.6%)
standard memset : 9981.5 MB/s (1.0%)
--- SDRAM_BANKLOW=2 ---
standard memcpy : 6560.5 MB/s (0.7%)
standard memset : 16119.0 MB/s (1.8%)

glmark2 2023.01
--- SDRAM_BANKLOW=1 ---
glmark2 Score: 1990
--- SDRAM_BANKLOW=2 ---
glmark2 Score: 2270

Geekbench 6
--- SDRAM_BANKLOW=1 ---
https://browser.geekbench.com/v6/cpu/9313128
Single=1108
Multi=2402
--- SDRAM_BANKLOW=2 ---
https://browser.geekbench.com/v6/cpu/9312651
Single = 1102
Multi = 2345

And the lot:
=======================================================
SDRAM_BANKLOW=2
=======================================================

tinymembench v0.4.10 (simple benchmark for memory throughput and latency)

==========================================================================
== Memory bandwidth tests ==
== ==
== Note 1: 1MB = 1000000 bytes ==
== Note 2: Results for 'copy' tests show how many bytes can be ==
== copied per second (adding together read and writen ==
== bytes would have provided twice higher numbers) ==
== Note 3: 2-pass copy means that we are using a small temporary buffer ==
== to first fetch data into it, and only then write it to the ==
== destination (source -> L1 cache, L1 cache -> destination) ==
== Note 4: If sample standard deviation exceeds 0.1%, it is shown in ==
== brackets ==
==========================================================================

C copy backwards : 6288.8 MB/s (0.6%)
C copy backwards (32 byte blocks) : 6270.4 MB/s (0.3%)
C copy backwards (64 byte blocks) : 6271.8 MB/s (0.3%)
C copy : 6598.0 MB/s (0.7%)
C copy prefetched (32 bytes step) : 6617.8 MB/s (0.7%)
C copy prefetched (64 bytes step) : 6615.7 MB/s (0.7%)
C 2-pass copy : 5818.3 MB/s (0.3%)
C 2-pass copy prefetched (32 bytes step) : 6642.0 MB/s (0.9%)
C 2-pass copy prefetched (64 bytes step) : 6650.3 MB/s (1.4%)
C fill : 16067.0 MB/s (1.6%)
C fill (shuffle within 16 byte blocks) : 16068.3 MB/s (1.5%)
C fill (shuffle within 32 byte blocks) : 16157.3 MB/s (1.7%)
C fill (shuffle within 64 byte blocks) : 16138.8 MB/s (1.6%)
NEON 64x2 COPY : 6537.3 MB/s (0.5%)
NEON 64x2x4 COPY : 6544.9 MB/s (0.5%)
NEON 64x1x4_x2 COPY : 6548.1 MB/s (0.6%)
NEON 64x2 COPY prefetch x2 : 6291.5 MB/s (0.5%)
NEON 64x2x4 COPY prefetch x1 : 6453.7 MB/s (0.5%)
NEON 64x2 COPY prefetch x1 : 6206.8 MB/s (0.4%)
NEON 64x2x4 COPY prefetch x1 : 6451.2 MB/s (0.5%)
---
standard memcpy : 6560.5 MB/s (0.7%)
standard memset : 16119.0 MB/s (1.8%)
---
NEON LDP/STP copy : 6551.7 MB/s (0.6%)
NEON LDP/STP copy pldl2strm (32 bytes step) : 6588.2 MB/s (0.5%)
NEON LDP/STP copy pldl2strm (64 bytes step) : 6587.2 MB/s (0.7%)
NEON LDP/STP copy pldl1keep (32 bytes step) : 6578.2 MB/s (0.7%)
NEON LDP/STP copy pldl1keep (64 bytes step) : 6578.3 MB/s (0.6%)
NEON LD1/ST1 copy : 6545.9 MB/s (0.7%)
NEON STP fill : 16130.8 MB/s (1.9%)
NEON STNP fill : 16088.6 MB/s (1.7%)
ARM LDP/STP copy : 6544.1 MB/s (0.8%)
ARM STP fill : 16077.2 MB/s (1.8%)
ARM STNP fill : 16101.3 MB/s (1.9%)

==========================================================================
== Framebuffer read tests. ==
== ==
== Many ARM devices use a part of the system memory as the framebuffer, ==
== typically mapped as uncached but with write-combining enabled. ==
== Writes to such framebuffers are quite fast, but reads are much ==
== slower and very sensitive to the alignment and the selection of ==
== CPU instructions which are used for accessing memory. ==
== ==
== Many x86 systems allocate the framebuffer in the GPU memory, ==
== accessible for the CPU via a relatively slow PCI-E bus. Moreover, ==
== PCI-E is asymmetric and handles reads a lot worse than writes. ==
== ==
== If uncached framebuffer reads are reasonably fast (at least 100 MB/s ==
== or preferably >300 MB/s), then using the shadow framebuffer layer ==
== is not necessary in Xorg DDX drivers, resulting in a nice overall ==
== performance improvement. For example, the xf86-video-fbturbo DDX ==
== uses this trick. ==
==========================================================================

NEON LDP/STP copy (from framebuffer) : 1814.6 MB/s (0.4%)
NEON LDP/STP 2-pass copy (from framebuffer) : 1628.9 MB/s (0.7%)
NEON LD1/ST1 copy (from framebuffer) : 1823.5 MB/s (0.4%)
NEON LD1/ST1 2-pass copy (from framebuffer) : 1633.5 MB/s (0.6%)
ARM LDP/STP copy (from framebuffer) : 1820.1 MB/s (0.4%)
ARM LDP/STP 2-pass copy (from framebuffer) : 1631.2 MB/s (0.6%)

==========================================================================
== Memory latency test ==
== ==
== Average time is measured for random memory accesses in the buffers ==
== of different sizes. The larger is the buffer, the more significant ==
== are relative contributions of TLB, L1/L2 cache misses and SDRAM ==
== accesses. For extremely large buffer sizes we are expecting to see ==
== page table walk with several requests to SDRAM for almost every ==
== memory access (though 64MiB is not nearly large enough to experience ==
== this effect to its fullest). ==
== ==
== Note 1: All the numbers are representing extra time, which needs to ==
== be added to L1 cache latency. The cycle timings for L1 cache ==
== latency can be usually found in the processor documentation. ==
== Note 2: Dual random read means that we are simultaneously performing ==
== two independent memory accesses at a time. In the case if ==
== the memory subsystem can't handle multiple outstanding ==
== requests, dual random read has the same timings as two ==
== single reads performed one after another. ==
==========================================================================

block size : single random read / dual random read
1024 : 0.0 ns / 0.0 ns
2048 : 0.0 ns / 0.0 ns
4096 : 0.0 ns / 0.0 ns
8192 : 0.0 ns / 0.0 ns
16384 : 0.0 ns / 0.0 ns
32768 : 0.0 ns / 0.0 ns
65536 : 0.0 ns / 0.0 ns
131072 : 0.8 ns / 1.1 ns
262144 : 1.2 ns / 1.5 ns
524288 : 2.1 ns / 2.7 ns
1048576 : 6.5 ns / 8.7 ns
2097152 : 10.1 ns / 12.5 ns
4194304 : 45.0 ns / 67.9 ns
8388608 : 71.2 ns / 95.2 ns
16777216 : 84.5 ns / 104.2 ns
33554432 : 92.8 ns / 108.7 ns
67108864 : 97.5 ns / 111.4 ns

=======================================================
glmark2 2023.01
=======================================================
OpenGL Information
GL_VENDOR: Broadcom
GL_RENDERER: V3D 7.1
GL_VERSION: 3.1 Mesa 23.2.1-1~bpo12+rpt3
Surface Config: buf=32 r=8 g=8 b=8 a=8 depth=24 stencil=0 samples=0
Surface Size: 800x600 windowed
=======================================================
[build] use-vbo=false: FPS: 2821 FrameTime: 0.355 ms
[build] use-vbo=true: FPS: 3790 FrameTime: 0.264 ms
[texture] texture-filter=nearest: FPS: 3002 FrameTime: 0.333 ms
[texture] texture-filter=linear: FPS: 2980 FrameTime: 0.336 ms
[texture] texture-filter=mipmap: FPS: 3096 FrameTime: 0.323 ms
[shading] shading=gouraud: FPS: 3121 FrameTime: 0.320 ms
[shading] shading=blinn-phong-inf: FPS: 2961 FrameTime: 0.338 ms
[shading] shading=phong: FPS: 2589 FrameTime: 0.386 ms
[shading] shading=cel: FPS: 2534 FrameTime: 0.395 ms
[bump] bump-render=high-poly: FPS: 1626 FrameTime: 0.615 ms
[bump] bump-render=normals: FPS: 3479 FrameTime: 0.288 ms
[bump] bump-render=height: FPS: 3266 FrameTime: 0.306 ms
[effect2d] kernel=0,1,0;1,-4,1;0,1,0;: FPS: 1438 FrameTime: 0.696 ms
[effect2d] kernel=1,1,1,1,1;1,1,1,1,1;1,1,1,1,1;: FPS: 624 FrameTime: 1.603 ms
[pulsar] light=false:quads=5:texture=false: FPS: 3648 FrameTime: 0.274 ms
[desktop] blur-radius=5:effect=blur:passes=1:separable=true:windows=4: FPS: 328 FrameTime: 3.055 ms
[desktop] effect=shadow:windows=4: FPS: 1196 FrameTime: 0.836 ms
[buffer] columns=200:interleave=false:update-dispersion=0.9:update-fraction=0.5:update-method=map: FPS: 585 FrameTime: 1.712 ms
[buffer] columns=200:interleave=false:update-dispersion=0.9:update-fraction=0.5:update-method=subdata: FPS: 524 FrameTime: 1.909 ms
[buffer] columns=200:interleave=true:update-dispersion=0.9:update-fraction=0.5:update-method=map: FPS: 627 FrameTime: 1.595 ms
[ideas] speed=duration: FPS: 2734 FrameTime: 0.366 ms
[jellyfish] <default>: FPS: 1588 FrameTime: 0.630 ms
[terrain] <default>: FPS: 91 FrameTime: 11.076 ms
[shadow] <default>: FPS: 175 FrameTime: 5.728 ms
[refract] <default>: FPS: 85 FrameTime: 11.845 ms
[conditionals] fragment-steps=0:vertex-steps=0: FPS: 4118 FrameTime: 0.243 ms
[conditionals] fragment-steps=5:vertex-steps=0: FPS: 2971 FrameTime: 0.337 ms
[conditionals] fragment-steps=0:vertex-steps=5: FPS: 4011 FrameTime: 0.249 ms
[function] fragment-complexity=low:fragment-steps=5: FPS: 3487 FrameTime: 0.287 ms
[function] fragment-complexity=medium:fragment-steps=5: FPS: 2452 FrameTime: 0.408 ms
[loop] fragment-loop=false:fragment-steps=5:vertex-steps=5: FPS: 3378 FrameTime: 0.296 ms
[loop] fragment-steps=5:fragment-uniform=false:vertex-steps=5: FPS: 3347 FrameTime: 0.299 ms
[loop] fragment-steps=5:fragment-uniform=true:vertex-steps=5: FPS: 2287 FrameTime: 0.437 ms
=======================================================
glmark2 Score: 2270
=======================================================

Geekbench 6
https://browser.geekbench.com/v6/cpu/9312651
Single = 1102
Multi = 2345

=======================================================
SDRAM_BANKLOW=1
=======================================================

tinymembench v0.4.10 (simple benchmark for memory throughput and latency)

==========================================================================
== Memory bandwidth tests ==
== ==
== Note 1: 1MB = 1000000 bytes ==
== Note 2: Results for 'copy' tests show how many bytes can be ==
== copied per second (adding together read and writen ==
== bytes would have provided twice higher numbers) ==
== Note 3: 2-pass copy means that we are using a small temporary buffer ==
== to first fetch data into it, and only then write it to the ==
== destination (source -> L1 cache, L1 cache -> destination) ==
== Note 4: If sample standard deviation exceeds 0.1%, it is shown in ==
== brackets ==
==========================================================================

C copy backwards : 5953.7 MB/s (0.6%)
C copy backwards (32 byte blocks) : 6040.8 MB/s (0.9%)
C copy backwards (64 byte blocks) : 5993.7 MB/s (0.3%)
C copy : 5844.0 MB/s (0.4%)
C copy prefetched (32 bytes step) : 5845.7 MB/s (0.4%)
C copy prefetched (64 bytes step) : 5838.4 MB/s (0.5%)
C 2-pass copy : 5779.0 MB/s (0.4%)
C 2-pass copy prefetched (32 bytes step) : 5835.9 MB/s (0.8%)
C 2-pass copy prefetched (64 bytes step) : 5839.4 MB/s (0.8%)
C fill : 9959.1 MB/s (0.9%)
C fill (shuffle within 16 byte blocks) : 9982.9 MB/s (1.0%)
C fill (shuffle within 32 byte blocks) : 9977.9 MB/s (0.9%)
C fill (shuffle within 64 byte blocks) : 9965.7 MB/s (0.9%)
NEON 64x2 COPY : 5765.3 MB/s (0.4%)
NEON 64x2x4 COPY : 5793.1 MB/s (0.5%)
NEON 64x1x4_x2 COPY : 5787.9 MB/s (0.6%)
NEON 64x2 COPY prefetch x2 : 5676.1 MB/s (0.5%)
NEON 64x2x4 COPY prefetch x1 : 5782.0 MB/s (0.5%)
NEON 64x2 COPY prefetch x1 : 5641.1 MB/s (0.5%)
NEON 64x2x4 COPY prefetch x1 : 5781.3 MB/s (0.5%)
---
standard memcpy : 5805.6 MB/s (0.6%)
standard memset : 9981.5 MB/s (1.0%)
---
NEON LDP/STP copy : 5788.9 MB/s (0.6%)
NEON LDP/STP copy pldl2strm (32 bytes step) : 5907.8 MB/s (0.5%)
NEON LDP/STP copy pldl2strm (64 bytes step) : 5907.5 MB/s (0.5%)
NEON LDP/STP copy pldl1keep (32 bytes step) : 5897.8 MB/s (0.5%)
NEON LDP/STP copy pldl1keep (64 bytes step) : 5898.5 MB/s (0.5%)
NEON LD1/ST1 copy : 5789.9 MB/s (0.5%)
NEON STP fill : 9978.4 MB/s (1.1%)
NEON STNP fill : 9983.6 MB/s (0.9%)
ARM LDP/STP copy : 5779.9 MB/s (0.5%)
ARM STP fill : 9972.8 MB/s (1.0%)
ARM STNP fill : 9953.1 MB/s (1.0%)

==========================================================================
== Framebuffer read tests. ==
== ==
== Many ARM devices use a part of the system memory as the framebuffer, ==
== typically mapped as uncached but with write-combining enabled. ==
== Writes to such framebuffers are quite fast, but reads are much ==
== slower and very sensitive to the alignment and the selection of ==
== CPU instructions which are used for accessing memory. ==
== ==
== Many x86 systems allocate the framebuffer in the GPU memory, ==
== accessible for the CPU via a relatively slow PCI-E bus. Moreover, ==
== PCI-E is asymmetric and handles reads a lot worse than writes. ==
== ==
== If uncached framebuffer reads are reasonably fast (at least 100 MB/s ==
== or preferably >300 MB/s), then using the shadow framebuffer layer ==
== is not necessary in Xorg DDX drivers, resulting in a nice overall ==
== performance improvement. For example, the xf86-video-fbturbo DDX ==
== uses this trick. ==
==========================================================================

NEON LDP/STP copy (from framebuffer) : 1796.6 MB/s (0.5%)
NEON LDP/STP 2-pass copy (from framebuffer) : 1614.1 MB/s (0.6%)
NEON LD1/ST1 copy (from framebuffer) : 1807.0 MB/s (0.6%)
NEON LD1/ST1 2-pass copy (from framebuffer) : 1625.4 MB/s (0.7%)
ARM LDP/STP copy (from framebuffer) : 1801.5 MB/s (0.5%)
ARM LDP/STP 2-pass copy (from framebuffer) : 1621.6 MB/s (0.7%)

==========================================================================
== Memory latency test ==
== ==
== Average time is measured for random memory accesses in the buffers ==
== of different sizes. The larger is the buffer, the more significant ==
== are relative contributions of TLB, L1/L2 cache misses and SDRAM ==
== accesses. For extremely large buffer sizes we are expecting to see ==
== page table walk with several requests to SDRAM for almost every ==
== memory access (though 64MiB is not nearly large enough to experience ==
== this effect to its fullest). ==
== ==
== Note 1: All the numbers are representing extra time, which needs to ==
== be added to L1 cache latency. The cycle timings for L1 cache ==
== latency can be usually found in the processor documentation. ==
== Note 2: Dual random read means that we are simultaneously performing ==
== two independent memory accesses at a time. In the case if ==
== the memory subsystem can't handle multiple outstanding ==
== requests, dual random read has the same timings as two ==
== single reads performed one after another. ==
==========================================================================

block size : single random read / dual random read
1024 : 0.0 ns / 0.0 ns
2048 : 0.0 ns / 0.0 ns
4096 : 0.0 ns / 0.0 ns
8192 : 0.0 ns / 0.0 ns
16384 : 0.0 ns / 0.0 ns
32768 : 0.0 ns / 0.0 ns
65536 : 0.0 ns / 0.0 ns
131072 : 0.8 ns / 1.1 ns
262144 : 1.2 ns / 1.5 ns
524288 : 1.5 ns / 1.7 ns
1048576 : 6.5 ns / 8.7 ns
2097152 : 9.5 ns / 11.3 ns
4194304 : 45.0 ns / 67.9 ns
8388608 : 71.4 ns / 95.2 ns
16777216 : 84.6 ns / 104.2 ns
33554432 : 92.9 ns / 108.7 ns
67108864 : 97.6 ns / 111.4 ns

=======================================================
glmark2 2023.01
=======================================================
OpenGL Information
GL_VENDOR: Broadcom
GL_RENDERER: V3D 7.1
GL_VERSION: 3.1 Mesa 23.2.1-1~bpo12+rpt3
Surface Config: buf=32 r=8 g=8 b=8 a=8 depth=24 stencil=0 samples=0
Surface Size: 800x600 windowed
=======================================================
[build] use-vbo=false: FPS: 2443 FrameTime: 0.409 ms
[build] use-vbo=true: FPS: 3296 FrameTime: 0.303 ms
[texture] texture-filter=nearest: FPS: 2627 FrameTime: 0.381 ms
[texture] texture-filter=linear: FPS: 2594 FrameTime: 0.386 ms
[texture] texture-filter=mipmap: FPS: 2695 FrameTime: 0.371 ms
[shading] shading=gouraud: FPS: 2748 FrameTime: 0.364 ms
[shading] shading=blinn-phong-inf: FPS: 2614 FrameTime: 0.383 ms
[shading] shading=phong: FPS: 2269 FrameTime: 0.441 ms
[shading] shading=cel: FPS: 2237 FrameTime: 0.447 ms
[bump] bump-render=high-poly: FPS: 1402 FrameTime: 0.714 ms
[bump] bump-render=normals: FPS: 3034 FrameTime: 0.330 ms
[bump] bump-render=height: FPS: 2832 FrameTime: 0.353 ms
[effect2d] kernel=0,1,0;1,-4,1;0,1,0;: FPS: 1260 FrameTime: 0.794 ms
[effect2d] kernel=1,1,1,1,1;1,1,1,1,1;1,1,1,1,1;: FPS: 551 FrameTime: 1.817 ms
[pulsar] light=false:quads=5:texture=false: FPS: 3196 FrameTime: 0.313 ms
[desktop] blur-radius=5:effect=blur:passes=1:separable=true:windows=4: FPS: 288 FrameTime: 3.483 ms
[desktop] effect=shadow:windows=4: FPS: 1038 FrameTime: 0.964 ms
[buffer] columns=200:interleave=false:update-dispersion=0.9:update-fraction=0.5:update-method=map: FPS: 481 FrameTime: 2.079 ms
[buffer] columns=200:interleave=false:update-dispersion=0.9:update-fraction=0.5:update-method=subdata: FPS: 498 FrameTime: 2.010 ms
[buffer] columns=200:interleave=true:update-dispersion=0.9:update-fraction=0.5:update-method=map: FPS: 533 FrameTime: 1.878 ms
[ideas] speed=duration: FPS: 2439 FrameTime: 0.410 ms
[jellyfish] <default>: FPS: 1413 FrameTime: 0.708 ms
[terrain] <default>: FPS: 81 FrameTime: 12.423 ms
[shadow] <default>: FPS: 156 FrameTime: 6.443 ms
[refract] <default>: FPS: 78 FrameTime: 12.884 ms
[conditionals] fragment-steps=0:vertex-steps=0: FPS: 3569 FrameTime: 0.280 ms
[conditionals] fragment-steps=5:vertex-steps=0: FPS: 2610 FrameTime: 0.383 ms
[conditionals] fragment-steps=0:vertex-steps=5: FPS: 3529 FrameTime: 0.283 ms
[function] fragment-complexity=low:fragment-steps=5: FPS: 3068 FrameTime: 0.326 ms
[function] fragment-complexity=medium:fragment-steps=5: FPS: 2157 FrameTime: 0.464 ms
[loop] fragment-loop=false:fragment-steps=5:vertex-steps=5: FPS: 2973 FrameTime: 0.336 ms
[loop] fragment-steps=5:fragment-uniform=false:vertex-steps=5: FPS: 2987 FrameTime: 0.335 ms
[loop] fragment-steps=5:fragment-uniform=true:vertex-steps=5: FPS: 2027 FrameTime: 0.494 ms
=======================================================
glmark2 Score: 1990
=======================================================

https://browser.geekbench.com/v6/cpu/9313128
Single=1108
Multi=2402

mby: Posts: 116; Joined: Sat Dec 15, 2018 3:05 pm

Re: NUMA Testing

Sun Dec 08, 2024 5:12 pm

Ther seems to be an issue both with official kernel 6.12 via rpi-update next as well as my self-compiled kernel 6.13:

Code: Select all

$ dmesg | grep NUMA
[ 0.000000] NUMA: Faking a node at [mem 0x0000000000000000-0x00000001ffffffff]
[ 0.000000] NUMA: Initialized distance table, cnt=8
[ 0.000000] mempolicy: NUMA default policy overridden to 'interleave:0-7'
[ 0.294843] pci_bus 0000:00: Unknown NUMA node; performance will be reduced
[ 0.432681] pci_bus 0001:00: Unknown NUMA node; performance will be reduced

With 6.6, I could see the usual 8 memory regions, instead.

And yes, I do have SDRAM_BANKLOW=1 giving me numa_policy=interleave numa=fake=8 on my Pi5

Thanks for your help!

P.S.: Official kernel 6.6.62 gives me

Code: Select all

$ dmesg | grep NUMA
[ 0.000000] NUMA: No NUMA configuration found
[ 0.000000] NUMA: NODE_DATA [mem 0x3fbfd2c0-0x3fbfffff]
[ 0.000000] NUMA: NODE_DATA [mem 0x7fffd2c0-0x7fffffff]
[ 0.000000] NUMA: NODE_DATA [mem 0xbfffd2c0-0xbfffffff]
[ 0.000000] NUMA: NODE_DATA [mem 0xffffd2c0-0xffffffff]
[ 0.000000] NUMA: NODE_DATA [mem 0x13fffd2c0-0x13fffffff]
[ 0.000000] NUMA: NODE_DATA [mem 0x17fffd2c0-0x17fffffff]
[ 0.000000] NUMA: NODE_DATA [mem 0x1bfffd2c0-0x1bfffffff]
[ 0.000000] NUMA: NODE_DATA [mem 0x1ffb892c0-0x1ffb8bfff]
[ 0.000000] mempolicy: NUMA default policy overridden to 'interleave:0-7'
[ 0.229644] pci_bus 0000:00: Unknown NUMA node; performance will be reduced
[ 0.381417] pci_bus 0001:00: Unknown NUMA node; performance will be reduced

dom: Raspberry Pi Engineer & Forum Moderator
Raspberry Pi Engineer & Forum Moderator; Posts: 8472; Joined: Wed Aug 17, 2011 7:41 pm

Re: NUMA Testing

Sun Dec 08, 2024 5:54 pm

mby wrote: ↑
Sun Dec 08, 2024 5:12 pm
P.S.: Official kernel 6.6.62 gives me

The implementation of fake NUMA on 6.12 (and later) is different, hence the difference in dmesg output.
The important line is:

Code: Select all

[ 0.000000] mempolicy: NUMA default policy overridden to 'interleave:0-7'

which is present in both. You should get similar performance in both.

mby: Posts: 116; Joined: Sat Dec 15, 2018 3:05 pm

Re: NUMA Testing

Sun Dec 08, 2024 6:08 pm

Thank you!

xeny: Posts: 50; Joined: Thu May 16, 2024 10:36 am

Re: NUMA Testing

Sun Dec 08, 2024 6:51 pm

bensimmo wrote: ↑
Sun Dec 08, 2024 3:01 pm
Well at least when latest becomes default on github so it then becomes the latest in rpi-eeprom, or something like that,
I never did get the apparent mismatch in naming with these.

I think that a trip into rpi-config , option 6, Bootloader version and specifying latest there, will result in rpi-eeprom installing the latest version.

User avatar bensimmo: Posts: 8140; Joined: Sun Dec 28, 2014 3:02 pm

Re: NUMA Testing

Sun Dec 08, 2024 9:40 pm

xeny wrote: ↑
Sun Dec 08, 2024 6:51 pm

bensimmo wrote: ↑
Sun Dec 08, 2024 3:01 pm
Well at least when latest becomes default on github so it then becomes the latest in rpi-eeprom, or something like that,
I never did get the apparent mismatch in naming with these.

I think that a trip into rpi-config , option 6, Bootloader version and specifying latest there, will result in rpi-eeprom installing the latest version.

Already set, see the terminal window
RELEASE: latest

xeny: Posts: 50; Joined: Thu May 16, 2024 10:36 am

Re: NUMA Testing

Mon Dec 09, 2024 7:32 am

bensimmo wrote: ↑
Sun Dec 08, 2024 9:40 pm

Already set, see the terminal window
RELEASE: latest

Apologies. I've had this work at some point for the CQ firmware. I wonder if there's a lag while github makes it the repos, and then latest is (briefly) latest?

bytter: Posts: 5; Joined: Fri Dec 06, 2024 11:45 pm

Re: NUMA Testing

Mon Dec 09, 2024 9:26 pm

Here's a more systematic analysis of the memory bandwidth under different parameters:

bandwidth-1-16M-RPI5_numa_corrected_threads-1.png: bandwidth-1-16M-RPI5_numa_corrected_threads-1.png (310.32 KiB) Viewed 2248 times

ejolson: Posts: 13865; Joined: Tue Mar 18, 2014 11:47 am

Re: NUMA Testing

Mon Dec 09, 2024 11:29 pm

bytter wrote: ↑
Mon Dec 09, 2024 9:26 pm
Here's a more systematic analysis of the memory bandwidth under different parameters:

Have you confirmed that cooling was adequate throughout the entire set of runs?

bytter: Posts: 5; Joined: Fri Dec 06, 2024 11:45 pm

Re: NUMA Testing

Mon Dec 09, 2024 11:37 pm

Active cooling. I am assuming it’s adequate, though I would need to control over thermal throttling.

ejolson: Posts: 13865; Joined: Tue Mar 18, 2014 11:47 am

Re: NUMA Testing

Tue Dec 10, 2024 12:34 am

ejolson wrote: ↑
Mon Dec 09, 2024 11:29 pm

bytter wrote: ↑
Mon Dec 09, 2024 9:26 pm
Here's a more systematic analysis of the memory bandwidth under different parameters:
Have you confirmed that cooling was adequate throughout the entire set of runs?

For me the data point that looks surprising is 4M random access write. That the only one where 2 cores seemed to experience a noticeably greater performance increase with NUMA than 4 cores.

Have you set the CPU governor to performance?

dom: Raspberry Pi Engineer & Forum Moderator
Raspberry Pi Engineer & Forum Moderator; Posts: 8472; Joined: Wed Aug 17, 2011 7:41 pm

Re: NUMA Testing

Tue Dec 10, 2024 11:53 am

bytter wrote: ↑
Mon Dec 09, 2024 11:37 pm
Active cooling. I am assuming it’s adequate, though I would need to control over thermal throttling.

"vcgencmd get_throttled" after doing the runs would confirm if they are valid. You'd like to see "throttled=0x0".

andrum99: Posts: 2530; Joined: Fri Jul 20, 2012 2:41 pm

Re: NUMA Testing

Tue Dec 10, 2024 3:54 pm

I'm now using the fake NUMA patch, having updated to the latest EEPROM firmware and kernel. I'm seeing these two messages in dmesg (after the interleave 0-7 one which confirms fake NUMA has done it's thang):

Code: Select all

[ 0.268246] pci_bus 0000:00: Unknown NUMA node; performance will be reduced
[ 0.399843] pci_bus 0001:00: Unknown NUMA node; performance will be reduced

What do these mean - is there actually a performance hit, or is it just assumed there will be?

Locked

Print view

143 posts

Return to "Advanced users"

Jump to