- dom
- Raspberry Pi Engineer & Forum Moderator
Raspberry Pi Engineer & Forum Moderator - Posts: 8472
- Joined: Wed Aug 17, 2011 7:41 pm
NUMA Testing
There has been a fair amount of work going on over the last year to improve sdram related performance on Pi4 and Pi5.
A lot of the testing and results discovered are described in this epic thread.
There were observations that 8GB Pi's were (for specific tests) slower than 4GB.
This is largely due to self refresh of sdram costing some of your theoretical bandwidth, and the standard (jedec) timings require more time to spent in refresh for larger sdram sizes.
During investigating this, I did discover that we can do a better. The sdram refresh interval is currently using the default data sheet settings.
You can actually monitor the temperature of the sdram and it reports if refresh at half or quarter the rate can be done.
That allows the overhead due to refresh to be reduced by a half or a quarter which does improve benchmark results.
We got in contact with Micron, and there is was good news. They have said they actually test their 8GB sdram with the 4GB refresh rate timing (rather than the slower jedec timings), and so it was be safe to run the 8GB parts with 4GB timing.
I've also been iterating over dozens of sdram and arm (largely cache and prefetch) related settings.
Typically each night capturing benchmark scores with tweaked settings and any positive results get added to the list of further testing.
If results are positive across a range of tests (e.g. geekbench, jetstream, linux kernel compile, memcpy/memset) we push it out (typically bootloader or firmware).
There have been a number of small boosts over the last year. Typically 1% here and there, but they add up.
There is one other significant issue. When multiple arm cores (and hardware blocks, like display, camera, pcie etc) access sdram, they compete.
From here:
Which pages belong to a bank depends on the physical address lines of the buffers used. We have found that it is pretty common for buffers to be allocated in pathologically bad ways.
NUMA allows us to have more control over this. We can split our sdram, into, say, 8 NUMA regions, and configure the kernel to interleave the allocations between the regions.
In addition the sdram controller allows the address bits used for segmenting the banks to be reconfigured (which we've exposed through eeprom config SDRAM_BANKLOW).
Doing this can result in significant performance boosts of sdram bandwidth constrained (typically multi-core) tasks.
[edit: apt now contain numa supporting bootloader/firmware/kernel, so rpi-update is not required]
TLDR: If you want to test the latest experimental improvements, then run:
then edit your bootloader config:
adding this line for pi5:
or this line for pi4:
and reboot. You should find "/proc/cmdline" contains the line "numa=fake=<n>".
Your Pi may be faster. Run some tests and report any improvements, or if you see any regressions.
A lot of the testing and results discovered are described in this epic thread.
There were observations that 8GB Pi's were (for specific tests) slower than 4GB.
This is largely due to self refresh of sdram costing some of your theoretical bandwidth, and the standard (jedec) timings require more time to spent in refresh for larger sdram sizes.
During investigating this, I did discover that we can do a better. The sdram refresh interval is currently using the default data sheet settings.
You can actually monitor the temperature of the sdram and it reports if refresh at half or quarter the rate can be done.
That allows the overhead due to refresh to be reduced by a half or a quarter which does improve benchmark results.
We got in contact with Micron, and there is was good news. They have said they actually test their 8GB sdram with the 4GB refresh rate timing (rather than the slower jedec timings), and so it was be safe to run the 8GB parts with 4GB timing.
I've also been iterating over dozens of sdram and arm (largely cache and prefetch) related settings.
Typically each night capturing benchmark scores with tweaked settings and any positive results get added to the list of further testing.
If results are positive across a range of tests (e.g. geekbench, jetstream, linux kernel compile, memcpy/memset) we push it out (typically bootloader or firmware).
There have been a number of small boosts over the last year. Typically 1% here and there, but they add up.
There is one other significant issue. When multiple arm cores (and hardware blocks, like display, camera, pcie etc) access sdram, they compete.
From here:
In the worst case, two sdram clients accessing different pages of the same bank, may cause repeated closing and opening of the pages, harming the usable bandwidth you can achieve.This means that if a page in a particular bank is already open, you cannot open another page within the same bank without first closing the currently open page. Once a page is open, it can be accessed multiple times without the need to reopen it for subsequent operations.
Which pages belong to a bank depends on the physical address lines of the buffers used. We have found that it is pretty common for buffers to be allocated in pathologically bad ways.
NUMA allows us to have more control over this. We can split our sdram, into, say, 8 NUMA regions, and configure the kernel to interleave the allocations between the regions.
In addition the sdram controller allows the address bits used for segmenting the banks to be reconfigured (which we've exposed through eeprom config SDRAM_BANKLOW).
Doing this can result in significant performance boosts of sdram bandwidth constrained (typically multi-core) tasks.
[edit: apt now contain numa supporting bootloader/firmware/kernel, so rpi-update is not required]
TLDR: If you want to test the latest experimental improvements, then run:
Code: Select all
sudo apt update && sudo apt full-upgrade
Code: Select all
sudo rpi-eeprom-config -e
Code: Select all
SDRAM_BANKLOW=1
Code: Select all
SDRAM_BANKLOW=3
Your Pi may be faster. Run some tests and report any improvements, or if you see any regressions.
- dom
- Raspberry Pi Engineer & Forum Moderator
Raspberry Pi Engineer & Forum Moderator - Posts: 8472
- Joined: Wed Aug 17, 2011 7:41 pm
Re: NUMA Testing
Pi4 has iommus on the arm cores and v3d. Pi5 also has iommus on display, camera, and hevc.
That allows the more aggressive banklow setting on Pi5 which gives greater performance benefits.
Pi5 also has faster sdram, better access to sdram (i.e. wider/faster internal buses), so generally the improvements with NUMA are greater.
But I get about 7% improvement on multicore geekbench on a Pi4 with two numa regions.
With banklow=1 it rises to about 10%, but you may find issues with display, camera and hevc.
That allows the more aggressive banklow setting on Pi5 which gives greater performance benefits.
Pi5 also has faster sdram, better access to sdram (i.e. wider/faster internal buses), so generally the improvements with NUMA are greater.
But I get about 7% improvement on multicore geekbench on a Pi4 with two numa regions.
With banklow=1 it rises to about 10%, but you may find issues with display, camera and hevc.
- dom
- Raspberry Pi Engineer & Forum Moderator
Raspberry Pi Engineer & Forum Moderator - Posts: 8472
- Joined: Wed Aug 17, 2011 7:41 pm
Re: NUMA Testing
On Pi5, reverting bootloader/firmware/kernel to start of year, I get:
807 / 1648 (https://browser.geekbench.com/v6/cpu/7748942)
With this test firmware and NUMA enabled I get:
902 / 2184 (https://browser.geekbench.com/v6/cpu/7749627)
For a 11.8% single core and 32.5% multi core improvement.
(not all of this gain is from NUMA as there have been numerous other improvements, but the NUMA part is significant).
807 / 1648 (https://browser.geekbench.com/v6/cpu/7748942)
With this test firmware and NUMA enabled I get:
902 / 2184 (https://browser.geekbench.com/v6/cpu/7749627)
For a 11.8% single core and 32.5% multi core improvement.
(not all of this gain is from NUMA as there have been numerous other improvements, but the NUMA part is significant).
Re: NUMA Testing
These are significant improvements! That's more than what we get when we upgrade from one generation of iPhone to the next :D
I have applied the eeprom update and kernel 6.58 on my pi 4 and 5 with 4 numa regions in both cases, the good news it that it works just fine, for now, I'll let you know if I experience any crashiness.
What is the rationale for choosing the number of regions? The more the merrier limited by how much memory a single process might want to allocate?
I have applied the eeprom update and kernel 6.58 on my pi 4 and 5 with 4 numa regions in both cases, the good news it that it works just fine, for now, I'll let you know if I experience any crashiness.
What is the rationale for choosing the number of regions? The more the merrier limited by how much memory a single process might want to allocate?
- dom
- Raspberry Pi Engineer & Forum Moderator
Raspberry Pi Engineer & Forum Moderator - Posts: 8472
- Joined: Wed Aug 17, 2011 7:41 pm
Re: NUMA Testing
It depends if sdram is single or dual rank, and what we set BANKLOW to.
We basically want the log2 of number of numa nodes to match the number of high bank bits, so a contiguous allocation from kernel using numa will span the banks.
High banks is (3-BANKLOW), with an extra one if using dual rank.
I think currently, 1GB and 2GB sdram parts are single rank (so get 4 numa regions with this scheme). 4GB and 8GB are dual rank (so get 8).
The bootloader will add an "optimal" numa=fake=<n> option if "numa_policy=interleave" if present and "numa=fake=<n>" is absent, as it knows the hardware configuration of the sdram.
You are free to specify numa=fake=<n> yourself in cmdline.txt, and if you find workloads that run better with your setting, compared to bootloader's, then that would be interesting information.
Re: NUMA Testing
Hmm, that's not what the commit message says:
The key setting numa=fake=<n> is not set here, so we will boot with a single
numa region and behaviour should be pretty much unchanged from before this PR.
Re: NUMA Testing
Tested with firmware 2024年10月21日, no numa config if not specified on the kernel command line.
- dom
- Raspberry Pi Engineer & Forum Moderator
Raspberry Pi Engineer & Forum Moderator - Posts: 8472
- Joined: Wed Aug 17, 2011 7:41 pm
Re: NUMA Testing
To be enabled, you need the updated kernel, the updated bootloader and the bootloader config setting SDRAM_BANKLOW=1 (for a Pi5).
Have you done that? Show me:
Have you done that? Show me:
Code: Select all
vcgencmd bootloader_version
uname -a
rpi-eeprom-config
cat /proc/cmdline
Re: NUMA Testing
Ran some tests and things seem to be working well overall. Geekbench 6 is indeed faster than ever before. So is Google Octane v2. Passmark performs extremely good as well, except for the "Memory Write" sub test, which regresses from ~11500 MB/s to 8000 something. Sysbench memory read performance is also better than ever (very nice latency, both average and max), while the write test corroborates Passmark's result, showing a severe regression in write performance.
Below is a fresh result on my 8GB board. My oldest result on this same board is 11786 MB/s and 90 ms avg and 129 ms max.
Below is a fresh result on my 8GB board. My oldest result on this same board is 11786 MB/s and 90 ms avg and 129 ms max.
Code: Select all
sysbench memory --memory-block-size=1G --memory-total-size=20G --memory-oper=write run
sysbench 1.0.20 (using system LuaJIT 2.1.0-beta3)
Running the test with following options:
Number of threads: 1
Initializing random number generator from current time
Running memory speed test with the following options:
block size: 1048576KiB
total size: 20480MiB
operation: write
scope: global
Initializing worker threads...
Threads started!
Total operations: 20 ( 8.08 per second)
20480.00 MiB transferred (8270.73 MiB/sec)
General statistics:
total time: 2.4750s
total number of events: 20
Latency (ms):
min: 122.58
avg: 123.74
max: 131.71
95th percentile: 123.28
sum: 2474.78
Threads fairness:
events (avg/stddev): 20.0000/0.00
execution time (avg/stddev): 2.4748/0.00Re: NUMA Testing
dom wrote: ↑Wed Oct 23, 2024 7:05 pm
Have you done that? Show me:Code: Select all
vcgencmd bootloader_version uname -a rpi-eeprom-config cat /proc/cmdline
Code: Select all
root@pi4:~# vcgencmd bootloader_version
2024年10月21日 15:24:54
version 951e1cc9d8b1d81c0ca1783a0634605616970bc3 (release)
timestamp 1729520694
update-time 1729696108
capabilities 0x0000007f
root@pi4:~# uname -a
Linux pi4 6.6.58-v8-rpios #29 SMP Wed Oct 23 14:45:05 UTC 2024 aarch64 aarch64 aarch64 GNU/Linux
root@pi4:~# rpi-eeprom-config
[all]
BOOT_UART=0
WAKE_ON_GPIO=1
POWER_OFF_ON_HALT=0
SDRAM_BANKLOW=3
root@pi4:~# cat /proc/cmdline
coherent_pool=1M 8250.nr_uarts=1 snd_bcm2835.enable_headphones=0 numa_policy=interleave snd_bcm2835.enable_headphones=1 snd_bcm2835.enable_hdmi=1 snd_bcm2835.enable_hdmi=0 smsc95xx.macaddr=E4:5F:01:6F:56:10 vc_mem.mem_base=0x3eb00000 vc_mem.mem_size=0x3ff00000 console=ttyS0,115200 multipath=off dwc_otg.lpm_enable=0 console=tty1 root=LABEL=writable rootfstype=ext4 rootwait fixrtc cpufreq.default_governor=performance cgroup_enable=memory numa=fake=8
- dom
- Raspberry Pi Engineer & Forum Moderator
Raspberry Pi Engineer & Forum Moderator - Posts: 8472
- Joined: Wed Aug 17, 2011 7:41 pm
Re: NUMA Testing
Interesting. I’ll try to reproduce write performance tomorrow.
- dom
- Raspberry Pi Engineer & Forum Moderator
Raspberry Pi Engineer & Forum Moderator - Posts: 8472
- Joined: Wed Aug 17, 2011 7:41 pm
Re: NUMA Testing
This is pi4? Have you updated firmware (start4.elf).fguerraz wrote: ↑Wed Oct 23, 2024 9:03 pmWithout numa=fake=8, numa doesn’t get enabled (easy to check with numactl)
vcgencmd version
It should insert numa=fake=2 automatically.
Re: NUMA Testing
Yes it's a pi4, with up to date firmware & start4.elf
Code: Select all
08a3a34b599be818b000610b14d076f0683acaceec94f48449210c1fe35038a2 firmware/start4.elf- dom
- Raspberry Pi Engineer & Forum Moderator
Raspberry Pi Engineer & Forum Moderator - Posts: 8472
- Joined: Wed Aug 17, 2011 7:41 pm
Re: NUMA Testing
"vcgencmd version" is a more definitive way to confirm the firmware you are running (but that sha256sum does match latest).fguerraz wrote: ↑Thu Oct 24, 2024 7:26 amYes it's a pi4, with up to date firmware & start4.elfCode: Select all
08a3a34b599be818b000610b14d076f0683acaceec94f48449210c1fe35038a2 firmware/start4.elf
To confirm, are you saying with no changes made by yourself to cmdline.txt, you are not seeing "numa=fake=2" added automatically and you have to add it manually? That is not what I see on Pi4:
Code: Select all
pi@pios:~ $ vcgencmd version
Oct 17 2024 11:34:50
Copyright (c) 2012 Broadcom
version b580e2acde306434ab07a913745a21451643ff55 (clean) (release) (start)
pi@pios:~ $ vcgencmd bootloader_version
2024年10月21日 15:24:54
version 951e1cc9d8b1d81c0ca1783a0634605616970bc3 (release)
timestamp 1729520694
update-time 1729770198
capabilities 0x0000007f
pi@pios:~ $ uname -a
Linux pios 6.6.58-v8+ #1809 SMP PREEMPT Wed Oct 23 11:53:53 BST 2024 aarch64 GNU/Linux
pi@pios:~ $ rpi-eeprom-config
BOOT_UART=0
WAKE_ON_GPIO=1
POWER_OFF_ON_HALT=0
SDRAM_BANKLOW=3
pi@pios:~ $ cat /boot/firmware/cmdline.txt
console=serial0,115200 console=tty1 root=PARTUUID=21779776-02 rootfstype=ext4 fsck.repair=yes rootwait quiet splash plymouth.ignore-serial-consoles cfg80211.ieee80211_regdom=GB
pi@pios:~ $ cat /proc/cmdline
coherent_pool=1M 8250.nr_uarts=1 snd_bcm2835.enable_headphones=0 numa_policy=interleave snd_bcm2835.enable_hdmi=0 numa=fake=2 smsc95xx.macaddr=DC:A6:32:AD:2A:38 vc_mem.mem_base=0x3ec00000 vc_mem.mem_size=0x40000000 console=ttyAMA0,115200 console=tty1 root=PARTUUID=21779776-02 rootfstype=ext4 fsck.repair=yes rootwait quiet splash plymouth.ignore-serial-consoles cfg80211.ieee80211_regdom=GB
Re: NUMA Testing
Yes, this is what I'm saying.
Code: Select all
$ vcgencmd bootloader_version
2024年10月21日 15:24:54
version 951e1cc9d8b1d81c0ca1783a0634605616970bc3 (release)
timestamp 1729520694
update-time 1729696108
capabilities 0x0000007f
$ vcgencmd version
Feb 29 2024 12:24:53
Copyright (c) 2012 Broadcom
version f4e2138c2adc8f3a92a3a65939e458f11d7298ba (clean) (release) (start)
- dom
- Raspberry Pi Engineer & Forum Moderator
Raspberry Pi Engineer & Forum Moderator - Posts: 8472
- Joined: Wed Aug 17, 2011 7:41 pm
Re: NUMA Testing
"vcgencmd version" shows the version running of the start4.elf firmware.
That is the code that adds "numa=fake=2", and your version is out of date.
The instructions (rpi-update) should have updated that. Did you not run that?
That is the code that adds "numa=fake=2", and your version is out of date.
The instructions (rpi-update) should have updated that. Did you not run that?
Re: NUMA Testing
I never said I was running on rpios :D only that I'm running the rpios kernel with the latest firmware. I'll update vcgencmd and report back (although I'm still confused what role it plays in in the boot sequence).
- dom
- Raspberry Pi Engineer & Forum Moderator
Raspberry Pi Engineer & Forum Moderator - Posts: 8472
- Joined: Wed Aug 17, 2011 7:41 pm
Re: NUMA Testing
As i said:
Don't update vcgencmd. Update start4.elf (correctly - because you are not running the version you think you are).
- timrowledge
- Posts: 1544
- Joined: Mon Oct 29, 2012 8:12 pm
Re: NUMA Testing
Trying this out on a Pi 5 w/NVME and had no problems at all installing.
For Squeak Smalltalk benchmarks I see small but useful improvements; we use versions of the Benchmark Shootout suite
For Squeak Smalltalk benchmarks I see small but useful improvements; we use versions of the Benchmark Shootout suite
- nbody
- binary trees
- chameneos redux
- thread ring
- nbody: 5.157 -> 5.095 = 1.2%
- binary trees: 3.398 -> 3.096 = 8.9%
- chameneos redux: 7.274 -> 5.239 = 28%
- thread ring: 8.347 -> 7.783 = 6.7%
Making Smalltalk on ARM since 1986; making your Scratch better since 2012
- dom
- Raspberry Pi Engineer & Forum Moderator
Raspberry Pi Engineer & Forum Moderator - Posts: 8472
- Joined: Wed Aug 17, 2011 7:41 pm
Re: NUMA Testing
- dom
- Raspberry Pi Engineer & Forum Moderator
Raspberry Pi Engineer & Forum Moderator - Posts: 8472
- Joined: Wed Aug 17, 2011 7:41 pm
Re: NUMA Testing
Good to hear.timrowledge wrote: ↑Thu Oct 24, 2024 6:32 pmFor Squeak Smalltalk benchmarks I see small but useful improvements; we use versions of the Benchmark Shootout suiteComparing a NUMA'd Pi 5 to a 'plain' Pi 5
- nbody
- binary trees
- chameneos redux
- thread ring
Not bad for some fiddling with ram timings.
- nbody: 5.157 -> 5.095 = 1.2%
- binary trees: 3.398 -> 3.096 = 8.9%
- chameneos redux: 7.274 -> 5.239 = 28%
- thread ring: 8.347 -> 7.783 = 6.7%
Re: NUMA Testing
I thought fake NUMA only changed the order in which pages were allocated to a process. Do the memory timings change as well?dom wrote: ↑Thu Oct 24, 2024 7:11 pmGood to hear.timrowledge wrote: ↑Thu Oct 24, 2024 6:32 pmFor Squeak Smalltalk benchmarks I see small but useful improvements; we use versions of the Benchmark Shootout suiteComparing a NUMA'd Pi 5 to a 'plain' Pi 5
- nbody
- binary trees
- chameneos redux
- thread ring
Not bad for some fiddling with ram timings.
- nbody: 5.157 -> 5.095 = 1.2%
- binary trees: 3.398 -> 3.096 = 8.9%
- chameneos redux: 7.274 -> 5.239 = 28%
- thread ring: 8.347 -> 7.783 = 6.7%
Is it possible to isolate the timing changes from the changes in the memory allocator?
Re: NUMA Testing
Thank you, @dom, awesome!
Could you apply the NUMA commits to 6.12 as well, please? – Thank you!
Best regards,
Michael
Could you apply the NUMA commits to 6.12 as well, please? – Thank you!
Best regards,
Michael
- dom
- Raspberry Pi Engineer & Forum Moderator
Raspberry Pi Engineer & Forum Moderator - Posts: 8472
- Joined: Wed Aug 17, 2011 7:41 pm
Re: NUMA Testing
There are no explicit memory timing changes in latest rpi-update version. There have been a few over the last few months.
The NUMA changes may improve the observed average latency, as the chance of hitting an open sdram page is higher.
Re: NUMA Testing
Am I right in thinking that in "typical" workloads, reads significantly outnumber writes(I dimly remember a ratio of 3:1)? If that is the case, worse write performance as a trade for better read performance would be an overall win.
Jump to
- Community
- General discussion
- Announcements
- Other languages
- Deutsch
- Español
- Français
- Italiano
- Nederlands
- 日本語
- Polski
- Português
- Русский
- Türkçe
- User groups and events
- Raspberry Pi Official Magazine
- Using the Raspberry Pi
- Beginners
- Troubleshooting
- Advanced users
- Assistive technology and accessibility
- Education
- Picademy
- Teaching and learning resources
- Staffroom, classroom and projects
- Astro Pi
- Mathematica
- High Altitude Balloon
- Weather station
- Programming
- C/C++
- Java
- Python
- Scratch
- Other programming languages
- Windows 10 for IoT
- Wolfram Language
- Bare metal, Assembly language
- Graphics programming
- OpenGLES
- OpenVG
- OpenMAX
- General programming discussion
- Projects
- Networking and servers
- Automation, sensing and robotics
- Graphics, sound and multimedia
- Other projects
- Media centres
- Gaming
- AIY Projects
- Hardware and peripherals
- Camera board
- Compute Module
- Official Display
- HATs and other add-ons
- Device Tree
- Interfacing (DSI, CSI, I2C, etc.)
- Keyboard computers (400, 500, 500+)
- Raspberry Pi Pico
- General
- SDK
- MicroPython
- Other RP2040 boards
- Zephyr
- Rust
- AI Accelerator
- AI Camera - IMX500
- Hailo
- Software
- Raspberry Pi OS
- Raspberry Pi Connect
- Raspberry Pi Desktop for PC and Mac
- Beta testing
- Other
- Android
- Debian
- FreeBSD
- Gentoo
- Linux Kernel
- NetBSD
- openSUSE
- Plan 9
- Puppy
- Arch
- Pidora / Fedora
- RISCOS
- Ubuntu
- Ye Olde Pi Shoppe
- For sale
- Wanted
- Off topic
- Off topic discussion