We use some essential cookies to make our website work.

We use optional cookies, as detailed in our cookie policy, to remember your settings and understand how you use our website.

143 posts
dom
Raspberry Pi Engineer & Forum Moderator
Raspberry Pi Engineer & Forum Moderator
Posts: 8472
Joined: Wed Aug 17, 2011 7:41 pm

NUMA Testing

Tue Oct 22, 2024 5:36 pm

There has been a fair amount of work going on over the last year to improve sdram related performance on Pi4 and Pi5.
A lot of the testing and results discovered are described in this epic thread.

There were observations that 8GB Pi's were (for specific tests) slower than 4GB.
This is largely due to self refresh of sdram costing some of your theoretical bandwidth, and the standard (jedec) timings require more time to spent in refresh for larger sdram sizes.

During investigating this, I did discover that we can do a better. The sdram refresh interval is currently using the default data sheet settings.
You can actually monitor the temperature of the sdram and it reports if refresh at half or quarter the rate can be done.
That allows the overhead due to refresh to be reduced by a half or a quarter which does improve benchmark results.

We got in contact with Micron, and there is was good news. They have said they actually test their 8GB sdram with the 4GB refresh rate timing (rather than the slower jedec timings), and so it was be safe to run the 8GB parts with 4GB timing.

I've also been iterating over dozens of sdram and arm (largely cache and prefetch) related settings.
Typically each night capturing benchmark scores with tweaked settings and any positive results get added to the list of further testing.
If results are positive across a range of tests (e.g. geekbench, jetstream, linux kernel compile, memcpy/memset) we push it out (typically bootloader or firmware).
There have been a number of small boosts over the last year. Typically 1% here and there, but they add up.

There is one other significant issue. When multiple arm cores (and hardware blocks, like display, camera, pcie etc) access sdram, they compete.
From here:
This means that if a page in a particular bank is already open, you cannot open another page within the same bank without first closing the currently open page. Once a page is open, it can be accessed multiple times without the need to reopen it for subsequent operations.
In the worst case, two sdram clients accessing different pages of the same bank, may cause repeated closing and opening of the pages, harming the usable bandwidth you can achieve.
Which pages belong to a bank depends on the physical address lines of the buffers used. We have found that it is pretty common for buffers to be allocated in pathologically bad ways.

NUMA allows us to have more control over this. We can split our sdram, into, say, 8 NUMA regions, and configure the kernel to interleave the allocations between the regions.
In addition the sdram controller allows the address bits used for segmenting the banks to be reconfigured (which we've exposed through eeprom config SDRAM_BANKLOW).

Doing this can result in significant performance boosts of sdram bandwidth constrained (typically multi-core) tasks.

[edit: apt now contain numa supporting bootloader/firmware/kernel, so rpi-update is not required]

TLDR: If you want to test the latest experimental improvements, then run:

Code: Select all

sudo apt update && sudo apt full-upgrade
then edit your bootloader config:

Code: Select all

sudo rpi-eeprom-config -e
adding this line for pi5:

Code: Select all

SDRAM_BANKLOW=1
or this line for pi4:

Code: Select all

SDRAM_BANKLOW=3
and reboot. You should find "/proc/cmdline" contains the line "numa=fake=<n>".

Your Pi may be faster. Run some tests and report any improvements, or if you see any regressions.

dom
Raspberry Pi Engineer & Forum Moderator
Raspberry Pi Engineer & Forum Moderator
Posts: 8472
Joined: Wed Aug 17, 2011 7:41 pm

Re: NUMA Testing

Tue Oct 22, 2024 5:39 pm

Pi4 has iommus on the arm cores and v3d. Pi5 also has iommus on display, camera, and hevc.
That allows the more aggressive banklow setting on Pi5 which gives greater performance benefits.

Pi5 also has faster sdram, better access to sdram (i.e. wider/faster internal buses), so generally the improvements with NUMA are greater.

But I get about 7% improvement on multicore geekbench on a Pi4 with two numa regions.
With banklow=1 it rises to about 10%, but you may find issues with display, camera and hevc.

dom
Raspberry Pi Engineer & Forum Moderator
Raspberry Pi Engineer & Forum Moderator
Posts: 8472
Joined: Wed Aug 17, 2011 7:41 pm

Re: NUMA Testing

Tue Oct 22, 2024 5:41 pm

On Pi5, reverting bootloader/firmware/kernel to start of year, I get:
807 / 1648 (https://browser.geekbench.com/v6/cpu/7748942)

With this test firmware and NUMA enabled I get:

902 / 2184 (https://browser.geekbench.com/v6/cpu/7749627)

For a 11.8% single core and 32.5% multi core improvement.

(not all of this gain is from NUMA as there have been numerous other improvements, but the NUMA part is significant).

fguerraz
Posts: 14
Joined: Mon Jun 03, 2024 9:15 am

Re: NUMA Testing

Wed Oct 23, 2024 3:25 pm

These are significant improvements! That's more than what we get when we upgrade from one generation of iPhone to the next :D

I have applied the eeprom update and kernel 6.58 on my pi 4 and 5 with 4 numa regions in both cases, the good news it that it works just fine, for now, I'll let you know if I experience any crashiness.

What is the rationale for choosing the number of regions? The more the merrier limited by how much memory a single process might want to allocate?

dom
Raspberry Pi Engineer & Forum Moderator
Raspberry Pi Engineer & Forum Moderator
Posts: 8472
Joined: Wed Aug 17, 2011 7:41 pm

Re: NUMA Testing

Wed Oct 23, 2024 3:40 pm

fguerraz wrote:
Wed Oct 23, 2024 3:25 pm
What is the rationale for choosing the number of regions? The more the merrier limited by how much memory a single process might want to allocate?
It depends if sdram is single or dual rank, and what we set BANKLOW to.

We basically want the log2 of number of numa nodes to match the number of high bank bits, so a contiguous allocation from kernel using numa will span the banks.
High banks is (3-BANKLOW), with an extra one if using dual rank.

I think currently, 1GB and 2GB sdram parts are single rank (so get 4 numa regions with this scheme). 4GB and 8GB are dual rank (so get 8).

The bootloader will add an "optimal" numa=fake=<n> option if "numa_policy=interleave" if present and "numa=fake=<n>" is absent, as it knows the hardware configuration of the sdram.

You are free to specify numa=fake=<n> yourself in cmdline.txt, and if you find workloads that run better with your setting, compared to bootloader's, then that would be interesting information.

fguerraz
Posts: 14
Joined: Mon Jun 03, 2024 9:15 am

Re: NUMA Testing

Wed Oct 23, 2024 6:22 pm

dom wrote:
Wed Oct 23, 2024 3:40 pm
The bootloader will add an "optimal" numa=fake=<n> option if "numa_policy=interleave" if present and "numa=fake=<n>" is absent, as it knows the hardware configuration of the sdram.
Hmm, that's not what the commit message says:
The key setting numa=fake=<n> is not set here, so we will boot with a single
numa region and behaviour should be pretty much unchanged from before this PR.

fguerraz
Posts: 14
Joined: Mon Jun 03, 2024 9:15 am

Re: NUMA Testing

Wed Oct 23, 2024 6:44 pm

Tested with firmware 2024年10月21日, no numa config if not specified on the kernel command line.

dom
Raspberry Pi Engineer & Forum Moderator
Raspberry Pi Engineer & Forum Moderator
Posts: 8472
Joined: Wed Aug 17, 2011 7:41 pm

Re: NUMA Testing

Wed Oct 23, 2024 7:05 pm

To be enabled, you need the updated kernel, the updated bootloader and the bootloader config setting SDRAM_BANKLOW=1 (for a Pi5).

Have you done that? Show me:

Code: Select all

vcgencmd bootloader_version
uname -a
rpi-eeprom-config
cat /proc/cmdline

Mikael
Posts: 127
Joined: Wed Feb 11, 2015 12:35 pm

Re: NUMA Testing

Wed Oct 23, 2024 8:03 pm

Ran some tests and things seem to be working well overall. Geekbench 6 is indeed faster than ever before. So is Google Octane v2. Passmark performs extremely good as well, except for the "Memory Write" sub test, which regresses from ~11500 MB/s to 8000 something. Sysbench memory read performance is also better than ever (very nice latency, both average and max), while the write test corroborates Passmark's result, showing a severe regression in write performance.

Below is a fresh result on my 8GB board. My oldest result on this same board is 11786 MB/s and 90 ms avg and 129 ms max.

Code: Select all

sysbench memory --memory-block-size=1G --memory-total-size=20G --memory-oper=write run
sysbench 1.0.20 (using system LuaJIT 2.1.0-beta3)
Running the test with following options:
Number of threads: 1
Initializing random number generator from current time
Running memory speed test with the following options:
 block size: 1048576KiB
 total size: 20480MiB
 operation: write
 scope: global
Initializing worker threads...
Threads started!
Total operations: 20 ( 8.08 per second)
20480.00 MiB transferred (8270.73 MiB/sec)
General statistics:
 total time: 2.4750s
 total number of events: 20
Latency (ms):
 min: 122.58
 avg: 123.74
 max: 131.71
 95th percentile: 123.28
 sum: 2474.78
Threads fairness:
 events (avg/stddev): 20.0000/0.00
 execution time (avg/stddev): 2.4748/0.00

fguerraz
Posts: 14
Joined: Mon Jun 03, 2024 9:15 am

Re: NUMA Testing

Wed Oct 23, 2024 9:03 pm

dom wrote:
Wed Oct 23, 2024 7:05 pm

Have you done that? Show me:

Code: Select all

vcgencmd bootloader_version
uname -a
rpi-eeprom-config
cat /proc/cmdline

Code: Select all


root@pi4:~# vcgencmd bootloader_version
2024年10月21日 15:24:54
version 951e1cc9d8b1d81c0ca1783a0634605616970bc3 (release)
timestamp 1729520694
update-time 1729696108
capabilities 0x0000007f
root@pi4:~# uname -a
Linux pi4 6.6.58-v8-rpios #29 SMP Wed Oct 23 14:45:05 UTC 2024 aarch64 aarch64 aarch64 GNU/Linux
root@pi4:~# rpi-eeprom-config
[all]
BOOT_UART=0
WAKE_ON_GPIO=1
POWER_OFF_ON_HALT=0
SDRAM_BANKLOW=3
root@pi4:~# cat /proc/cmdline
coherent_pool=1M 8250.nr_uarts=1 snd_bcm2835.enable_headphones=0 numa_policy=interleave snd_bcm2835.enable_headphones=1 snd_bcm2835.enable_hdmi=1 snd_bcm2835.enable_hdmi=0 smsc95xx.macaddr=E4:5F:01:6F:56:10 vc_mem.mem_base=0x3eb00000 vc_mem.mem_size=0x3ff00000 console=ttyS0,115200 multipath=off dwc_otg.lpm_enable=0 console=tty1 root=LABEL=writable rootfstype=ext4 rootwait fixrtc cpufreq.default_governor=performance cgroup_enable=memory numa=fake=8
Without numa=fake=8, numa doesn’t get enabled (easy to check with numactl)

dom
Raspberry Pi Engineer & Forum Moderator
Raspberry Pi Engineer & Forum Moderator
Posts: 8472
Joined: Wed Aug 17, 2011 7:41 pm

Re: NUMA Testing

Wed Oct 23, 2024 9:08 pm

Mikael wrote:
Wed Oct 23, 2024 8:03 pm
Sysbench memory read performance is also better than ever (very nice latency, both average and max), while the write test corroborates Passmark's result, showing a severe regression in write performance.

Interesting. I’ll try to reproduce write performance tomorrow.

dom
Raspberry Pi Engineer & Forum Moderator
Raspberry Pi Engineer & Forum Moderator
Posts: 8472
Joined: Wed Aug 17, 2011 7:41 pm

Re: NUMA Testing

Wed Oct 23, 2024 10:08 pm

fguerraz wrote:
Wed Oct 23, 2024 9:03 pm
Without numa=fake=8, numa doesn’t get enabled (easy to check with numactl)
This is pi4? Have you updated firmware (start4.elf).
vcgencmd version

It should insert numa=fake=2 automatically.

fguerraz
Posts: 14
Joined: Mon Jun 03, 2024 9:15 am

Re: NUMA Testing

Thu Oct 24, 2024 7:26 am

dom wrote:
Wed Oct 23, 2024 10:08 pm
This is pi4? Have you updated firmware (start4.elf).
vcgencmd version

It should insert numa=fake=2 automatically.
Yes it's a pi4, with up to date firmware & start4.elf

Code: Select all

08a3a34b599be818b000610b14d076f0683acaceec94f48449210c1fe35038a2 firmware/start4.elf

dom
Raspberry Pi Engineer & Forum Moderator
Raspberry Pi Engineer & Forum Moderator
Posts: 8472
Joined: Wed Aug 17, 2011 7:41 pm

Re: NUMA Testing

Thu Oct 24, 2024 11:53 am

fguerraz wrote:
Thu Oct 24, 2024 7:26 am
dom wrote:
Wed Oct 23, 2024 10:08 pm
This is pi4? Have you updated firmware (start4.elf).
vcgencmd version

It should insert numa=fake=2 automatically.
Yes it's a pi4, with up to date firmware & start4.elf

Code: Select all

08a3a34b599be818b000610b14d076f0683acaceec94f48449210c1fe35038a2 firmware/start4.elf
"vcgencmd version" is a more definitive way to confirm the firmware you are running (but that sha256sum does match latest).

To confirm, are you saying with no changes made by yourself to cmdline.txt, you are not seeing "numa=fake=2" added automatically and you have to add it manually? That is not what I see on Pi4:

Code: Select all

pi@pios:~ $ vcgencmd version
Oct 17 2024 11:34:50 
Copyright (c) 2012 Broadcom
version b580e2acde306434ab07a913745a21451643ff55 (clean) (release) (start)
pi@pios:~ $ vcgencmd bootloader_version
2024年10月21日 15:24:54
version 951e1cc9d8b1d81c0ca1783a0634605616970bc3 (release)
timestamp 1729520694
update-time 1729770198
capabilities 0x0000007f
pi@pios:~ $ uname -a
Linux pios 6.6.58-v8+ #1809 SMP PREEMPT Wed Oct 23 11:53:53 BST 2024 aarch64 GNU/Linux
pi@pios:~ $ rpi-eeprom-config
BOOT_UART=0
WAKE_ON_GPIO=1
POWER_OFF_ON_HALT=0
SDRAM_BANKLOW=3
pi@pios:~ $ cat /boot/firmware/cmdline.txt 
console=serial0,115200 console=tty1 root=PARTUUID=21779776-02 rootfstype=ext4 fsck.repair=yes rootwait quiet splash plymouth.ignore-serial-consoles cfg80211.ieee80211_regdom=GB
pi@pios:~ $ cat /proc/cmdline
coherent_pool=1M 8250.nr_uarts=1 snd_bcm2835.enable_headphones=0 numa_policy=interleave snd_bcm2835.enable_hdmi=0 numa=fake=2 smsc95xx.macaddr=DC:A6:32:AD:2A:38 vc_mem.mem_base=0x3ec00000 vc_mem.mem_size=0x40000000 console=ttyAMA0,115200 console=tty1 root=PARTUUID=21779776-02 rootfstype=ext4 fsck.repair=yes rootwait quiet splash plymouth.ignore-serial-consoles cfg80211.ieee80211_regdom=GB
Note "numa=fake=2" is in /proc/cmdline but not /boot/firmware/cmdline.txt.

fguerraz
Posts: 14
Joined: Mon Jun 03, 2024 9:15 am

Re: NUMA Testing

Thu Oct 24, 2024 12:03 pm

dom wrote:
Thu Oct 24, 2024 11:53 am
To confirm, are you saying with no changes made by yourself to cmdline.txt, you are not seeing "numa=fake=2" added automatically and you have to add it manually?
Note "numa=fake=2" is in /proc/cmdline but not /boot/firmware/cmdline.txt.
Yes, this is what I'm saying.

Code: Select all

$ vcgencmd bootloader_version
2024年10月21日 15:24:54
version 951e1cc9d8b1d81c0ca1783a0634605616970bc3 (release)
timestamp 1729520694
update-time 1729696108
capabilities 0x0000007f
$ vcgencmd version
Feb 29 2024 12:24:53 
Copyright (c) 2012 Broadcom
version f4e2138c2adc8f3a92a3a65939e458f11d7298ba (clean) (release) (start)
vcgencmd version is different, but I don't see why it would be involved (and it has not received any interesting changes recently anyways).

dom
Raspberry Pi Engineer & Forum Moderator
Raspberry Pi Engineer & Forum Moderator
Posts: 8472
Joined: Wed Aug 17, 2011 7:41 pm

Re: NUMA Testing

Thu Oct 24, 2024 12:18 pm

"vcgencmd version" shows the version running of the start4.elf firmware.
That is the code that adds "numa=fake=2", and your version is out of date.
The instructions (rpi-update) should have updated that. Did you not run that?

fguerraz
Posts: 14
Joined: Mon Jun 03, 2024 9:15 am

Re: NUMA Testing

Thu Oct 24, 2024 12:31 pm

dom wrote:
Thu Oct 24, 2024 12:18 pm
"vcgencmd version" shows the version running of the start4.elf firmware.
That is the code that adds "numa=fake=2", and your version is out of date.
The instructions (rpi-update) should have updated that. Did you not run that?
I never said I was running on rpios :D only that I'm running the rpios kernel with the latest firmware. I'll update vcgencmd and report back (although I'm still confused what role it plays in in the boot sequence).

dom
Raspberry Pi Engineer & Forum Moderator
Raspberry Pi Engineer & Forum Moderator
Posts: 8472
Joined: Wed Aug 17, 2011 7:41 pm

Re: NUMA Testing

Thu Oct 24, 2024 12:50 pm

fguerraz wrote:
Thu Oct 24, 2024 12:31 pm
I never said I was running on rpios :D only that I'm running the rpios kernel with the latest firmware. I'll update vcgencmd and report back (although I'm still confused what role it plays in in the boot sequence).
As i said:
dom wrote:
Thu Oct 24, 2024 12:18 pm
"vcgencmd version" shows the version running of the start4.elf firmware.
Don't update vcgencmd. Update start4.elf (correctly - because you are not running the version you think you are).

timrowledge
Posts: 1544
Joined: Mon Oct 29, 2012 8:12 pm

Re: NUMA Testing

Thu Oct 24, 2024 6:32 pm

Trying this out on a Pi 5 w/NVME and had no problems at all installing.

For Squeak Smalltalk benchmarks I see small but useful improvements; we use versions of the Benchmark Shootout suite
  • nbody
  • binary trees
  • chameneos redux
  • thread ring
Comparing a NUMA'd Pi 5 to a 'plain' Pi 5
  • nbody: 5.157 -> 5.095 = 1.2%
  • binary trees: 3.398 -> 3.096 = 8.9%
  • chameneos redux: 7.274 -> 5.239 = 28%
  • thread ring: 8.347 -> 7.783 = 6.7%
Not bad for some fiddling with ram timings.
Making Smalltalk on ARM since 1986; making your Scratch better since 2012

dom
Raspberry Pi Engineer & Forum Moderator
Raspberry Pi Engineer & Forum Moderator
Posts: 8472
Joined: Wed Aug 17, 2011 7:41 pm

Re: NUMA Testing

Thu Oct 24, 2024 7:10 pm

dom wrote:
Wed Oct 23, 2024 9:08 pm
Mikael wrote:
Wed Oct 23, 2024 8:03 pm
Sysbench memory read performance is also better than ever (very nice latency, both average and max), while the write test corroborates Passmark's result, showing a severe regression in write performance.
Interesting. I’ll try to reproduce write performance tomorrow.
Yes I can reproduce.

dom
Raspberry Pi Engineer & Forum Moderator
Raspberry Pi Engineer & Forum Moderator
Posts: 8472
Joined: Wed Aug 17, 2011 7:41 pm

Re: NUMA Testing

Thu Oct 24, 2024 7:11 pm

timrowledge wrote:
Thu Oct 24, 2024 6:32 pm
For Squeak Smalltalk benchmarks I see small but useful improvements; we use versions of the Benchmark Shootout suite
  • nbody
  • binary trees
  • chameneos redux
  • thread ring
Comparing a NUMA'd Pi 5 to a 'plain' Pi 5
  • nbody: 5.157 -> 5.095 = 1.2%
  • binary trees: 3.398 -> 3.096 = 8.9%
  • chameneos redux: 7.274 -> 5.239 = 28%
  • thread ring: 8.347 -> 7.783 = 6.7%
Not bad for some fiddling with ram timings.
Good to hear.

ejolson
Posts: 13865
Joined: Tue Mar 18, 2014 11:47 am

Re: NUMA Testing

Thu Oct 24, 2024 7:34 pm

dom wrote:
Thu Oct 24, 2024 7:11 pm
timrowledge wrote:
Thu Oct 24, 2024 6:32 pm
For Squeak Smalltalk benchmarks I see small but useful improvements; we use versions of the Benchmark Shootout suite
  • nbody
  • binary trees
  • chameneos redux
  • thread ring
Comparing a NUMA'd Pi 5 to a 'plain' Pi 5
  • nbody: 5.157 -> 5.095 = 1.2%
  • binary trees: 3.398 -> 3.096 = 8.9%
  • chameneos redux: 7.274 -> 5.239 = 28%
  • thread ring: 8.347 -> 7.783 = 6.7%
Not bad for some fiddling with ram timings.
Good to hear.
I thought fake NUMA only changed the order in which pages were allocated to a process. Do the memory timings change as well?

Is it possible to isolate the timing changes from the changes in the memory allocator?

mby
Posts: 116
Joined: Sat Dec 15, 2018 3:05 pm

Re: NUMA Testing

Fri Oct 25, 2024 9:12 am

Thank you, @dom, awesome!

Could you apply the NUMA commits to 6.12 as well, please? – Thank you!

Best regards,
Michael

dom
Raspberry Pi Engineer & Forum Moderator
Raspberry Pi Engineer & Forum Moderator
Posts: 8472
Joined: Wed Aug 17, 2011 7:41 pm

Re: NUMA Testing

Fri Oct 25, 2024 9:32 am

ejolson wrote:
Thu Oct 24, 2024 7:34 pm
I thought fake NUMA only changed the order in which pages were allocated to a process. Do the memory timings change as well?

Is it possible to isolate the timing changes from the changes in the memory allocator?
There are no explicit memory timing changes in latest rpi-update version. There have been a few over the last few months.
The NUMA changes may improve the observed average latency, as the chance of hitting an open sdram page is higher.

xeny
Posts: 50
Joined: Thu May 16, 2024 10:36 am

Re: NUMA Testing

Fri Oct 25, 2024 9:48 am

Am I right in thinking that in "typical" workloads, reads significantly outnumber writes(I dimly remember a ratio of 3:1)? If that is the case, worse write performance as a trade for better read performance would be an overall win.

143 posts

Return to "Advanced users"

AltStyle によって変換されたページ (->オリジナル) /