Performance of rotating video by 90/270 degree on Pi3
Hey.
I'm currently trying to get rid of the old non-KMS method of displaying video content on Pi3s so I can unify playback across all 64bit capable models. This works well for videos rotated by 0 or 180 degree. What's more a problem are 90/270 degrees. Right now I'm still not able to match the performance of the now deprecated mmal/dispmanx method. It allowed rotating content with 1.5bytes/pixel of bandwidth for H264 YUV420 videos. I'm not sure if it's possible to achieve this somehow. What I've tried so far with a 1080x1920@25 H264 video on a 1920x1080 display:
I'm currently trying to get rid of the old non-KMS method of displaying video content on Pi3s so I can unify playback across all 64bit capable models. This works well for videos rotated by 0 or 180 degree. What's more a problem are 90/270 degrees. Right now I'm still not able to match the performance of the now deprecated mmal/dispmanx method. It allowed rotating content with 1.5bytes/pixel of bandwidth for H264 YUV420 videos. I'm not sure if it's possible to achieve this somehow. What I've tried so far with a 1080x1920@25 H264 video on a 1920x1080 display:
- Import the decoded frame into a GL texture and render that using two triangles and a minimal fragment shader This results in around 47fps. I kind of expect that to be slower as the YUV420 texture is placed into an ARGB8888 (or XRGB8888, makes no noticeable difference) GL surface first which is then placed via DRM on the screen. So it's 4bytes per pixel.
Code: Select all
uniform samplerExternalOES Texture; varying vec2 TexCoord; void main() { gl_FragColor = texture2D(Texture, TexCoord); } - Transpose the video frame using the writeback connector and its new DRM_MODE_TRANSPOSE feature. The resulting framebuffer is then placed via DRM on the display. This seems to be somehow worse with only 35fps and the transpose takes around 25ms alone. The transpose target can only be ABGR8888, so I don't really expect that to be fast as it's essentially also 4bytes/pixel now.
info-beamer hosted - A user and programmer friendly digital signage platform for the Pi: https://info-beamer.com/hosted
Re: Performance of rotating video by 90/270 degree on Pi3
So there's no way to get that level of performance using DRM at the moment, correct? Would this be something to potentially add? I would see a few ways this might be implemented:
- At the decoder level, I would expect there's a mechanism to signal those V4L2 devices dynamic properties (like exposure for cameras?). A rotation property would then result in the buffers being rotated before it is delivered to userspace. Not sure if how the rest of ffmpeg would handle sudden resolution changes though.
- There seems to be a deinterlacer filter using a bcm2835-codec-isp? Maybe a rotation filter could be added kernel side and then a similar ffmpeg filter could use that interface? Seems like a lot of work.
- Without touching ffmpeg at all: The writeback connector could get a special case that allows transpose if and only if 1) There's a only single plane covering the complete CRTC using a DRM_FORMAT_YUV420 framebuffer and 2) the target WRITEBACK_FB_ID is also a DRM_FORMAT_YUV420 framebuffer.
info-beamer hosted - A user and programmer friendly digital signage platform for the Pi: https://info-beamer.com/hosted
- dom
- Raspberry Pi Engineer & Forum Moderator
Raspberry Pi Engineer & Forum Moderator - Posts: 8472
- Joined: Wed Aug 17, 2011 7:41 pm
Re: Performance of rotating video by 90/270 degree on Pi3
6by9 is better placed to answer, but my guess is 1 and 2 are not going to happen.
Anything that requires allocating new video buffers is not going to happen in the kernel.
Supporting transposed DRM_FORMAT_YUV420 in writeback connector may be an option.
Re: Performance of rotating video by 90/270 degree on Pi3
I'm trying to understand how that might potentially work. The txp output target seems to always be just a single linear RGB or RGBA framebuffer. From your response I gather there's an unused mode that outputs linear 8bit (from linear 8bit input) somewhere? If that's the case, targeting an output YUV420 framebuffer would imply that this has to be done in three passes. One for each of the input's plane into one of the output framebuffer plane. This seems like it really breaks the abstraction currently implemented in txp.c unless there's a way to queue up three individual runs somehow.
Another horrible alternative would be for me to dig deep into reverse engineered code targeting the HVS and try to implement this from userspace. But I'm not sure if that's would be even possible without massively tripping over what the kernel is doing.
Another horrible alternative would be for me to dig deep into reverse engineered code targeting the HVS and try to implement this from userspace. But I'm not sure if that's would be even possible without massively tripping over what the kernel is doing.
info-beamer hosted - A user and programmer friendly digital signage platform for the Pi: https://info-beamer.com/hosted
- dom
- Raspberry Pi Engineer & Forum Moderator
Raspberry Pi Engineer & Forum Moderator - Posts: 8472
- Joined: Wed Aug 17, 2011 7:41 pm
Re: Performance of rotating video by 90/270 degree on Pi3
Sorry I was misremembering. I though the transposer supported a one byte per pixel mode and could do yuv in three passes, and that was what the old openmax/mmal components used when transpose was requested.
But I've checked the spec and the destination pixel format must be RGB (16-bit, 24-bit and 32-bit are supported).
And I've checked the code, and the video_render does a software (on VideoCore side) transpose (on each of the Y, U and V planes).
So, I think both transposer and GL, if they are required to write RGB are likely to harm performance due to the increased sdram bandwidth.
ARM neon code may do an okay job, by reading and writing the YUV directly. Transpose is a little awkward to do efficiently as you ideally want to store tiles of data, so you can both read and write efficiently. I think with neon you have enough space for a pair of 16x16 tiles.
I also imagine it's possible to use the 3d hardware to do this and that feels like the best option.
You want a way that doesn't use standard RGB textures and framebuffers, but directly reads and writes the YUV planes.
I'm afraid my GL knowledge isn't sufficient, other than to say the hardware can handle what you want, and I suspect there is a way of persuading GL to use already allocated (likely dmabuf) buffers for input and output and a simple shader could handle the transpose.
But I've checked the spec and the destination pixel format must be RGB (16-bit, 24-bit and 32-bit are supported).
And I've checked the code, and the video_render does a software (on VideoCore side) transpose (on each of the Y, U and V planes).
So, I think both transposer and GL, if they are required to write RGB are likely to harm performance due to the increased sdram bandwidth.
ARM neon code may do an okay job, by reading and writing the YUV directly. Transpose is a little awkward to do efficiently as you ideally want to store tiles of data, so you can both read and write efficiently. I think with neon you have enough space for a pair of 16x16 tiles.
I also imagine it's possible to use the 3d hardware to do this and that feels like the best option.
You want a way that doesn't use standard RGB textures and framebuffers, but directly reads and writes the YUV planes.
I'm afraid my GL knowledge isn't sufficient, other than to say the hardware can handle what you want, and I suspect there is a way of persuading GL to use already allocated (likely dmabuf) buffers for input and output and a simple shader could handle the transpose.
Re: Performance of rotating video by 90/270 degree on Pi3
Wow. Interesting. Thanks for looking into the old "video_render" code. I certainly didn't expect that. Almost seems like a waste of a perfectly good compute to not use that :)
I'll see if I can manage to do rotation in software. Might be worth a try.
As for GL rotation: Maybe I could import each video plane as its own 8bit texture, import a corresponding target 8bit texture and use it as a GL framebuffer as target to transpose. The reimport the resulting 3 planes are new video frame. Hm.
Thanks for the insights.
I'll see if I can manage to do rotation in software. Might be worth a try.
As for GL rotation: Maybe I could import each video plane as its own 8bit texture, import a corresponding target 8bit texture and use it as a GL framebuffer as target to transpose. The reimport the resulting 3 planes are new video frame. Hm.
Thanks for the insights.
info-beamer hosted - A user and programmer friendly digital signage platform for the Pi: https://info-beamer.com/hosted
- cleverca22
- Posts: 9593
- Joined: Sat Aug 18, 2012 2:33 pm
Re: Performance of rotating video by 90/270 degree on Pi3
as far as i know, the v3d core on the pi0-pi3 doesnt support linear textures, it must be in T-Format
and all conversion of textures from linear to TF is done in cpu
i think the pi4 added a hw conversion block
so any kind of video being fed into opengl, is going to have a major cpu overhead
i'm not sure how things like wayland gl composition can even work on the pi0-pi3 without the same overheads
Re: Performance of rotating video by 90/270 degree on Pi3
Thanks! Interesting. In hindsight make sense: The expensive part then isn't the rotation itself (I can easily rotate the texture in GL by any degree), but the import into GL. So that route isn't going nowhere then. I'll see if I can transpose the video frame prior to the DRM framebuffer import.cleverca22 wrote: ↑Wed Feb 19, 2025 7:18 amas far as i know, the v3d core on the pi0-pi3 doesnt support linear textures, it must be in T-Format
and all conversion of textures from linear to TF is done in cpu
info-beamer hosted - A user and programmer friendly digital signage platform for the Pi: https://info-beamer.com/hosted
- dom
- Raspberry Pi Engineer & Forum Moderator
Raspberry Pi Engineer & Forum Moderator - Posts: 8472
- Joined: Wed Aug 17, 2011 7:41 pm
Re: Performance of rotating video by 90/270 degree on Pi3
We have used the 3d hardware for acceleration jobs in the past on Pi3.
We have a yadif style deinterlace, and parts of the hevc decoder written in qpu assembly.
These directly read and write (through a dma type mechanism) memory and work directly on the YUV buffers directly (no texture unit involved).
So the hardware is capable of doing what we want, it's just down to whether GL exposes what is needed to do this simply.
We have a yadif style deinterlace, and parts of the hevc decoder written in qpu assembly.
These directly read and write (through a dma type mechanism) memory and work directly on the YUV buffers directly (no texture unit involved).
So the hardware is capable of doing what we want, it's just down to whether GL exposes what is needed to do this simply.
Re: Performance of rotating video by 90/270 degree on Pi3
I've seen this code back in 2019 when initially trying to support HEVC on the Pi4. That's some black magic :shock:. It seems it no longer in the current 5.1.6 branch, right?dom wrote: ↑Wed Feb 19, 2025 10:45 amWe have used the 3d hardware for acceleration jobs in the past on Pi3.
We have a yadif style deinterlace, and parts of the hevc decoder written in qpu assembly.
These directly read and write (through a dma type mechanism) memory and work directly on the YUV buffers directly (no texture unit involved).
I managed to write a NEON 8x8 transpose code and it does transpose a 1920x1080 byte buffer in 6ms, so it's close but should be sufficient. I also managed to mmap the decoded planes from the hardware H264 decoder, transpose them into another DMA buffer and import that into DRM. The video is transposed on the screen (yay!), but for some reason the transpose now takes 35ms, so it's ~5x slower. My guess is that this is due to the video decoder placing those buffers into uncached memory (like in this post?). Only way out of that might be to force the H264 decoder to allocate buffers on the ARM side somehow? EDIT: Just discovered the "dmabuf_alloc" "cma" decoder setting. Hm
info-beamer hosted - A user and programmer friendly digital signage platform for the Pi: https://info-beamer.com/hosted
- dom
- Raspberry Pi Engineer & Forum Moderator
Raspberry Pi Engineer & Forum Moderator - Posts: 8472
- Joined: Wed Aug 17, 2011 7:41 pm
Re: Performance of rotating video by 90/270 degree on Pi3
Yeah, it doesn't play nicely with arm side 3d driver, so it's deprecated (but I suspect could work if you weren't using kms).
Just using it as an example that 3d hardware can support YUV buffers.
There is a way of submitting qpu jobs that does play nicely with arm side 3d. e.g. here.
VPU jobs should also be possible (this was a scheme also used by pi3 hevc decoder).
The VPU is quite good at software transpose as it has a 64x64 byte register file, so can load horizontally, then store vertically, with both using decent width sdram accesses.
Yes, uncached accesses from the arm will be slow (especially with narrower accesses). Transposing 16x16 (if possible) would help.I managed to write a NEON 8x8 transpose code and it does transpose a 1920x1080 byte buffer in 6ms, so it's close but should be sufficient. I also managed to mmap the decoded planes from the hardware H264 decoder, transpose them into another DMA buffer and import that into DRM. The video is transposed on the screen (yay!), but for some reason the transpose now takes 35ms, so it's ~5x slower. My guess is that this is due to the video decoder placing those buffers into uncached memory (like in this post?). Only way out of that might be to force the H264 decoder to allocate buffers on the ARM side somehow?
I believe cacheable buffers from H264 are possible (as we had the same issue in Kodi when doing software deinterlace of hardware decoded video) but I can't immediately tell you what determines that.
Re: Performance of rotating video by 90/270 degree on Pi3
I'm still utterly confused of how I'm supposed to write code for that. Maybe another day. Thanks for the link to an example!
Figured it out. Setting the decoder option "dmabuf_alloc" to "cma" made it fast. So now the transpose takes ~12ms or 80% CPU, but the result is neatly transposed and the combined output is back at 60fps for a 24fps FullHD video. Better than before (~40fps via GL rotation) and I'm sure there's still room for improvement. Of course it would still be better of offload this to hardware somehow.dom wrote: ↑Wed Feb 19, 2025 12:38 pmYes, uncached accesses from the arm will be slow (especially with narrower accesses). Transposing 16x16 (if possible) would help.
I believe cacheable buffers from H264 are possible (as we had the same issue in Kodi when doing software deinterlace of hardware decoded video) but I can't immediately tell you what determines that.
info-beamer hosted - A user and programmer friendly digital signage platform for the Pi: https://info-beamer.com/hosted
Re: Performance of rotating video by 90/270 degree on Pi3
And it works. I can now dynamically switch between rotation via GL or by transposing the three YUV420 planes prior to importing into DRM. Right now I'm using a 8x8 NEON transposer at the core. It improves framerates quite a bit at the cost of CPU.
About that QPU method: I found this assembler. Guess I might take a closer look at some point.
About that QPU method: I found this assembler. Guess I might take a closer look at some point.
info-beamer hosted - A user and programmer friendly digital signage platform for the Pi: https://info-beamer.com/hosted
- cleverca22
- Posts: 9593
- Joined: Sat Aug 18, 2012 2:33 pm
Re: Performance of rotating video by 90/270 degree on Pi3
i can also offer some code for this
Code: Select all
v8ld H(0++,0), (r0+=r1) REP64
it will then increment the source address by r1, and repeat on row=1 of the matrix, for a total of 64 loads
Code: Select all
v8st V(0,0++), (r0+=r1) REP64)
but in this raw configuration, its writing to a 16x64 part of the matrix, then reading a 64x16 part
so youll either need to do 4 load/store sets to fill out the entire 64x64 and save it, or switch to 32bit load/store (but i'm fuzzy on how byte order impacts that when transposing), or just eat the cost and do only REP16, so its always 16x16 transfers
the above ASM can then be compiled with the binutils from https://github.com/itszor/vc4-toolchain
https://github.com/ali1234/vcpoke/blob/master/main.c
and then you can use vcpoke as a starter example
`mem_alloc()` is just one way of doing it, that allocates from the firmware heap, but anything where you can access it in linux and get the phys addr (such as CMA) works
lines 21-34 is just copying the binary of that asm into the allocated region (you could also just open a .bin file, and load that)
and then execute_code() asks the VPU firmware to run a chunk of assembly at an arbitrary address
execute_code() works on the entire pi0-pi4 range, and the firmware will grab a mutex on the vector registers before executing it, so you dont have any clashes there
pi5 has sadly removed the mailbox function needed for execute_code()
the only things you need to be aware of after that, is that the VPU is in a different addr space, youll want to use the C alias for most addressing, and deal with the arm caches before/after accessing things
and since your asm is loaded to a random address, it needs to be fully PIC, or you need to apply relocations when loading it
in my testing, i have gotten the VPU to basically saturate the sdram bandwidth, something over 95% of the theoretical bandwidth
feel free to ask more questions if you want to explore that more and get stuck
Re: Performance of rotating video by 90/270 degree on Pi3
Thank you for the amazing post. This is certainly enough to think I might actually do this. I got the toolchain running and:
So that compiles, I can use objdump and objcopy to get the machine code. Neat.
I looked into vcpoke, it runs (and the address 0x7e207080 seems to point as a string containing "pixv"?). One step that I'm missing is how to get the physical address of the memory referenced by the dma_buf file descriptor returned through libavcodec. They are allocated here using DMA_HEAP_IOCTL_ALLOC ioctl and I don't see any obvious way to retrieve the physical address I then likely have to pass in as one of the six registers, so the code knows what to transpose. What I found is VC_SM_CMA_CMD_IMPORT_DMABUF that maybe returns the physical address? I'll see if that works out.
Code: Select all
a.out: file format elf32-vc4
Disassembly of section .text:
00000000 <_start>:
0: 06 f8 38 00 80 03 v8ld H(0++,0),(r0+=r1) REP64
6: 40 f8 00 00
a: 86 f8 04 e0 80 03 v8st V(0,0++),(r0+=r1) REP64
10: e0 13 00 00
I looked into vcpoke, it runs (and the address 0x7e207080 seems to point as a string containing "pixv"?). One step that I'm missing is how to get the physical address of the memory referenced by the dma_buf file descriptor returned through libavcodec. They are allocated here using DMA_HEAP_IOCTL_ALLOC ioctl and I don't see any obvious way to retrieve the physical address I then likely have to pass in as one of the six registers, so the code knows what to transpose. What I found is VC_SM_CMA_CMD_IMPORT_DMABUF that maybe returns the physical address? I'll see if that works out.
info-beamer hosted - A user and programmer friendly digital signage platform for the Pi: https://info-beamer.com/hosted
- cleverca22
- Posts: 9593
- Joined: Sat Aug 18, 2012 2:33 pm
Re: Performance of rotating video by 90/270 degree on Pi3
yep, thats pixel valve 1, on the pi0-pi3 family, it controls the video timing params for DSI1 or SMI(unused)
ive not worked with dma_buf handles much yet
all of my dma has either been done from the kernel side, or by asking the firmware to allocate things within the old gpu_mem heap
a modern dma example
a decade old dma example
and looking at them, they are nearly identical! heh
- 6by9
- Raspberry Pi Engineer & Forum Moderator
Raspberry Pi Engineer & Forum Moderator - Posts: 18477
- Joined: Wed Dec 04, 2013 11:27 am
Re: Performance of rotating video by 90/270 degree on Pi3
Sorry, been on holiday for a few days.
The ISP can transpose Bayer images only.
The only other dedicated hardware block with a transpose is the transposer, accessible via DRM's writeback connector. The transposer can only write out as 16, 24, or 32bpp RGB. With https://github.com/raspberrypi/linux/pull/6312 you gained the ability to transpose the output.
Pi5 has scaled back the VPU significantly.cleverca22 wrote: ↑Thu Feb 20, 2025 12:08 amexecute_code() works on the entire pi0-pi4 range, and the firmware will grab a mutex on the vector registers before executing it, so you dont have any clashes there
pi5 has sadly removed the mailbox function needed for execute_code()
Getting hold of VPU addresses is trickier. vc_sm_cma is scaling back what is available to userspace, particularly as it gets upstreamed.
Implementing a V4L2 rotation M2M device using the VPU for any 8bpc symmetrical (ie not YUV422) colour format probably wouldn't be too difficult. As Dom says, the VPU can load the VRF in nice efficient bursts, and save out with rotation with equally efficient bursts. I haven't messed with the firmware for a while, and other priorities may mean it takes a while to get implemented.
Software Engineer at Raspberry Pi Ltd. Views expressed are still personal views.
I'm not interested in doing contracts for bespoke functionality - please don't ask.
I'm not interested in doing contracts for bespoke functionality - please don't ask.
Re: Performance of rotating video by 90/270 degree on Pi3
Holy moly. It works. Thanks again for the great pointers on how to actually even execute a single NOP instruction. Now I'm transposing a three plane 1920x1080 24fps YUV420 video and CPU is at 38% (compared to NEON's 80%) while the DRM HDMI output framerate is perfectly at 60fps. This is awesome.
The VC_SM_CMA_IOCTL_MEM_IMPORT_DMABUF ioctl allowed me to import already existing dma_bufs in a way that I can grab their physical address in the .dma_addr return value. From there it was trying around to get increasingly more complex assembly code running to see what's possible. One issue I ran into is the lack of registers. There's seem to be plenty, but it looks like I'm restricted to r0 to r5 when submitting code like that? At least I always get the following kernel error when I use anything else:
This made it a bit more complicated as I now have to pass in the pointer to a control struct located alongside the code. This is a bit unfortunate as it means I'll have to synchronize access to avoid multiple callers stepping over each other. I would prefer a completely register based method, but I (for now) couldn't figure out how to store additional values. I'm not sure there's a stack allocated for me?
Right now I'm using 16x16 transpose as video frames seem to be aligned like that. I guess a potential next step would be to improve the loop to make use of larger transpose operations if possible. Or maybe loop unroll a bit.
Here's my code:
(mod edit for language!)
The VC_SM_CMA_IOCTL_MEM_IMPORT_DMABUF ioctl allowed me to import already existing dma_bufs in a way that I can grab their physical address in the .dma_addr return value. From there it was trying around to get increasingly more complex assembly code running to see what's possible. One issue I ran into is the lack of registers. There's seem to be plenty, but it looks like I'm restricted to r0 to r5 when submitting code like that? At least I always get the following kernel error when I use anything else:
Code: Select all
[ 752.910448] raspberrypi-firmware soc:firmware: Request 0x00030010 returned status 0x80000001
Right now I'm using 16x16 transpose as video frames seem to be aligned like that. I guess a potential next step would be to improve the loop to make use of larger transpose operations if possible. Or maybe loop unroll a bit.
Here's my code:
Code: Select all
;
; info-beamer.com 16x16 transpose using VPU
; -----------------------------------------
; Assembler doc:
; https://github.com/hermanhermitage/videocoreiv/wiki/VideoCore-IV-Programmers-Manual
.equ SRC_ADDR, 0
.equ DST_ADDR, 4
.equ TILE_WIDTH, 8
.equ TILE_HEIGHT, 12
.equ SRC_PITCH, 16
.equ DST_PITCH, 20
.equ ROW, 24
mov r5, r0
next_row:
ld r4, (r5 + ROW) ; r4 = row
ld r0, (r5 + SRC_ADDR)
ld r1, (r5 + SRC_PITCH)
mul r1, r4
addscale r0, r1 << 4 ; r0 = src read
ld r1, (r5 + DST_ADDR)
addscale r1, r4 << 4 ; r1 = dst write
add r4, 1
st r4, (r5 + ROW) ; increment and save row
ld r2, (r5 + TILE_WIDTH)
ld r3, (r5 + SRC_PITCH)
ld r4, (r5 + DST_PITCH)
next_col:
v8ld H(0++,0),(r0+=r3) REP16
v8st V(0,0++),(r1+=r4) REP16
add r0, 16
addscale r1, r4 << 4
sub r2, 1
cmp r2, 0
bne next_col
ld r4, (r5 + ROW) ; r4 = row
ld r3, (r5 + TILE_HEIGHT)
cmp r3, r4
bne next_row
rts
info-beamer hosted - A user and programmer friendly digital signage platform for the Pi: https://info-beamer.com/hosted
- dom
- Raspberry Pi Engineer & Forum Moderator
Raspberry Pi Engineer & Forum Moderator - Posts: 8472
- Joined: Wed Aug 17, 2011 7:41 pm
Re: Performance of rotating video by 90/270 degree on Pi3
I believe VC has a 256-bit width path to memory, so 16-bit loads/stores are optimal.
With a 16-bit load, the ls-byte is loaded to H(0,0) and ms-byte to H(0,16), so you need
to shuffle before writing back.
I think you want the vinterl/vintelh pair for getting the bytes into natural order in VRF after the v16ld.
Then the veven/vodd pair for getting the bytes back into a format suitable for a v16st.
With a 16-bit load, the ls-byte is loaded to H(0,0) and ms-byte to H(0,16), so you need
to shuffle before writing back.
I think you want the vinterl/vintelh pair for getting the bytes into natural order in VRF after the v16ld.
Then the veven/vodd pair for getting the bytes back into a format suitable for a v16st.
- cleverca22
- Posts: 9593
- Joined: Sat Aug 18, 2012 2:33 pm
Re: Performance of rotating video by 90/270 degree on Pi3
the calling convention on the VPU is that r0-r5 can be clobbered by functions, and those also hold the first 6 arguments (i believe, this is from memory)
if you want more arguments, they have to go on the stack, but the execute_code api doesnt offer that
so the other way, is to put a struct full of config flags into another dma_buf, and then pass over a pointer to that struct
i get most of my asm examples and test vpu things via https://github.com/librerpi/lk-overlay/
that project is mainly about replacing the VPU firmware entirely, so you could just add a proper transpose service yourself
however, i dont have hdmi or hardware video decode working, so you would be set back several steps if you wanted to use it
if you want more registers, here is a how my interrupt handler saves them
Code: Select all
800002a4: a9 02 stm r6-r15,(--sp)
800002a6: c7 02 stm r16-r23,(--sp)
...
800002ba: 47 02 ldm r16-r23,(sp++)
800002bc: 29 02 ldm r6-r15,(sp++)
then you can use those regs for whatever computation you want
ive also done some benchmarking
Code: Select all
v32ld HY(0++,0),(r1+=r2) REPx, 11 cycle startup (for L1 hit), plus 2*x, given that (r2%64)==0
i also did some wonky tests, the alignment of r1 doesnt matter, so you can read mis-aligned 32bit ints all day
HOWEVER!, the stride must keep the same mis-alignment
i think there is a wide shifted, that will swap bytes around to fix the alignment as you load, over multiple bus transfers
and flushing that and re-configuring it, causes a serious performance penalty
your function gets called within the mailbox thread of the VPU firmware, while a mutex for the vector-regs is held
i think timer interrupts can interrupt your work and schedule other threads, and it will eventually get back to you
and yes, there will be a stack, just dont trash it, because there are other things using it, this api is basically just calling a function pointer in the middle of another program
- dom
- Raspberry Pi Engineer & Forum Moderator
Raspberry Pi Engineer & Forum Moderator - Posts: 8472
- Joined: Wed Aug 17, 2011 7:41 pm
Re: Performance of rotating video by 90/270 degree on Pi3
But with a 1920x1080 image transpose, you won't be getting any cache hits.cleverca22 wrote: ↑Thu Feb 20, 2025 6:25 pmthis is doing a `uint32_t[16]` load, and repeating X times, and the best possible speed (vpu L1 cache hit), is 11 + (x*2) clock cycles
And in this case you should be using the 0xCxxxxxxx alias to bypass the cache anyway (a cache miss is slower that a bypass cache address).
For alignment you want the start address and stride to be naturally aligned to the vector size (e.g. 16-byte for v8ld, and 32-bytes for v16ld).
v32ld has the same alignment requirements as v16ld (it goes through axi as a two-beat burst of 32 bytes).
That's why I recommend v16ld - you'll get basically the same speed from sdram, and will have enough register space to handle what you've read.
Misaligned will likely be twice as slow (yes, it does two naturally aligned reads and throws half the data away).
Re: Performance of rotating video by 90/270 degree on Pi3
Thanks. I'll see if it's worth getting that done too. This all is quite a deep rabbit hole to get lost into while optimizing stuff. My brain is melting, but the instruction set is quite fun. Right now I'm dynamically using both 16x16 and 16x64 transposers. A transpose now takes between 10 and 14ms, so it's fast enough to do 1080p60.dom wrote: ↑Thu Feb 20, 2025 2:58 pmI believe VC has a 256-bit width path to memory, so 16-bit loads/stores are optimal.
With a 16-bit load, the ls-byte is loaded to H(0,0) and ms-byte to H(0,16), so you need
to shuffle before writing back.
I think you want the vinterl/vintelh pair for getting the bytes into natural order in VRF after the v16ld.
Then the veven/vodd pair for getting the bytes back into a format suitable for a v16st.
I've seen this mentioned a few times. Is the physical memory aliased and some addresses don't use the cache? But that's only possible on the Pi3 with its 1GB memory, correct? From looking the the addresses, right now it seems the physical address I get are in the 0xeXXXXXXX range. What's with those? None of the vc_sm_cma_ioctl_import_dmabuf.cached settings seems to make a difference.dom wrote: And in this case you should be using the 0xCxxxxxxx alias to bypass the cache anyway (a cache miss is slower that a bypass cache address).
That's fine for me. It's only the Pi3 I struggle to get the performance back to the old firmware levels. With this VPU transpose, I think I'm basically there and on the Pi4 and 5 the GPU is fast enough to do the rotation for me.6by9 wrote: Pi5 has scaled back the VPU significantly. Getting hold of VPU addresses is trickier. vc_sm_cma is scaling back what is available to userspace, particularly as it gets upstreamed.
Might be interesting to have an official alternative that helps with implementing a 90/270 degree rotation for DRM planes. For now I no longer immediately need that, so I'm good :)6by9 wrote: Implementing a V4L2 rotation M2M device using the VPU for any 8bpc symmetrical (ie not YUV422) colour format probably wouldn't be too difficult. As Dom says, the VPU can load the VRF in nice efficient bursts, and save out with rotation with equally efficient bursts. I haven't messed with the firmware for a while, and other priorities may mean it takes a while to get implemented.
I also noticed that setting the scaling governor to 'performance' is a lot better than 'ondemand'. I guess the VPU shared a clock somewhere and is faster if the ARM side is clocked higher?
Last edited by dividuum on Fri Feb 21, 2025 11:22 pm, edited 1 time in total.
info-beamer hosted - A user and programmer friendly digital signage platform for the Pi: https://info-beamer.com/hosted
- dom
- Raspberry Pi Engineer & Forum Moderator
Raspberry Pi Engineer & Forum Moderator - Posts: 8472
- Joined: Wed Aug 17, 2011 7:41 pm
Re: Performance of rotating video by 90/270 degree on Pi3
VideoCore has a 30-bit (1GB) address bus, and uses the top two bits to set caching modes.dividuum wrote: ↑Fri Feb 21, 2025 7:27 pmI've seen this mentioned a few times. Is the physical memory aliased and some addresses don't use the cache? But that's only possible on the Pi3 with it's 1GB memory, correct? From looking the the addresses, right now it seems the physical address I get are in the 0xeXXXXXXX range. What's with those?
Basically 00=cached. 11=uncached. We refer to 11 as "C" alias (i.e. address |= 0xc000000 to make it uncached).
0xeXXXXXXX is just "C" alias of 0x2XXXXXXX - i.e. your buffer is above 512M.
Which is the correct alias for what you are doing.
Note: you can change your dmabufs back to uncached (on arm side) now. It will save you some invalidate/writeback calls into the kernel, but as the data shouldn't be dirty, that won't be expensive.
Yes, core_freq determines how fast the VPU is.I also noticed that setting the scaling governor to 'performance' is a lot better than 'ondemand'. I guess the VPU shared a clock somewhere and is faster if the ARM side is clocked higher?
It goes up when the cpufreq governor says we're busy, but in the case the arm is idle and VPU is busy, the cpufreq governor doesn't know.
"core_freq_min=400" in config.txt is a workaround that avoids forcing the arm clock high.
If you wanted to be nicer, you could use a mailbox call to request core=400 before you start your job and set it back to 250 after.
- cleverca22
- Posts: 9593
- Joined: Sat Aug 18, 2012 2:33 pm
Re: Performance of rotating video by 90/270 degree on Pi3
there are also examples of this in the unicam driver
https://github.com/raspberrypi/linux/bl ... 2612-L2616
for similar reasons, the arm is idle when doing camera stuff, so the cpufreq thinks a slow clock is good enough
but then the slow core clock causes issues with the camera interface, and this chunk of code boosts it up
it also affects the pi4
as with the pi0-pi3, the VPU is only in a 30bit addr space, with the top 2 bits controlling the cache flags
so it can only access the lower 1gig of ram
everything beyond that, is only accessible to the arm, and a few 64bit capable bus masters (v3d, one dma channel, and other things)
its just far less of a problem on the pi0-pi3, because it physically cant have more then 1gig, so there is no "high mem" to deal with
i believe thats what i noticed when doing mis-aligned reads, where the stride was aligned
it wasnt so much throwing out half the data, but keeping it for the next beat, so the overall performance seemed the same
but if the stride is also mis-aligned, the extra half isnt aligned right, so it has to throw it out, and restart everything, just wrecking performance
Return to "Graphics programming"
Jump to
- Community
- General discussion
- Announcements
- Other languages
- Deutsch
- Español
- Français
- Italiano
- Nederlands
- 日本語
- Polski
- Português
- Русский
- Türkçe
- User groups and events
- Raspberry Pi Official Magazine
- Using the Raspberry Pi
- Beginners
- Troubleshooting
- Advanced users
- Assistive technology and accessibility
- Education
- Picademy
- Teaching and learning resources
- Staffroom, classroom and projects
- Astro Pi
- Mathematica
- High Altitude Balloon
- Weather station
- Programming
- C/C++
- Java
- Python
- Scratch
- Other programming languages
- Windows 10 for IoT
- Wolfram Language
- Bare metal, Assembly language
- Graphics programming
- OpenGLES
- OpenVG
- OpenMAX
- General programming discussion
- Projects
- Networking and servers
- Automation, sensing and robotics
- Graphics, sound and multimedia
- Other projects
- Media centres
- Gaming
- AIY Projects
- Hardware and peripherals
- Camera board
- Compute Module
- Official Display
- HATs and other add-ons
- Device Tree
- Interfacing (DSI, CSI, I2C, etc.)
- Keyboard computers (400, 500, 500+)
- Raspberry Pi Pico
- General
- SDK
- MicroPython
- Other RP2040 boards
- Zephyr
- Rust
- AI Accelerator
- AI Camera - IMX500
- Hailo
- Software
- Raspberry Pi OS
- Raspberry Pi Connect
- Raspberry Pi Desktop for PC and Mac
- Beta testing
- Other
- Android
- Debian
- FreeBSD
- Gentoo
- Linux Kernel
- NetBSD
- openSUSE
- Plan 9
- Puppy
- Arch
- Pidora / Fedora
- RISCOS
- Ubuntu
- Ye Olde Pi Shoppe
- For sale
- Wanted
- Off topic
- Off topic discussion