-
Notifications
You must be signed in to change notification settings - Fork 129
af_xdp: Allow running onload as non-root #265
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
zhuyifei1999
commented
Mar 13, 2025
I looked more at the security aspect of this. There are two concerns I can think of
- AF_XDP rings gives non-root full access to a TX/RX ring (especially in zerocopy mode), allowing the user to masquarade as any sender, essentially CAP_NET_RAW. However, in sfc mode onload already provides CAP_NET_RAW on various code paths via oo_cplane_empower_cap_net_raw, so I don't think this is breaking an existing security boundary.
- But somewhat more concerning is af_xdp.c's use of sys_call_area, especially for the XDP_MMAP_OFFSETS syscall. The syscall provides an memory offset, and if unchecked, gives path to a OOB memory access if userspace manages to race writing to that area before it is unmapped. I thought of two ways around this:
- Invoke the syscall though a "usermode driver" embedded to the kernel module. This incurs somewhat of an overhead with all the IPCs and actually starting the process.
- Embed the offsets in the module with a couple
offsetof()calls. The offsets are literally a bunch of struct offsets. Considering a module compile for one kernel isn't supposed to work on a different kernel anyways, there's not much reason onload absolutely have to use the getsockopt syscall to obtain the offsets.
Wdyt?
zhuyifei1999
commented
Apr 22, 2025
Embed the offsets in the module with a couple offsetof() calls.
It looks like struct xdp_rxtx_ring is in net/xdp/xsk_queue.h which is not exported. Grabbing the offset from UMH or BTF is complex, so rather than that, I'll grab it in af_xdp_nic_init_hardware which is trusted to be root.
ligallag-amd
commented
Apr 24, 2025
AF_XDP rings gives non-root full access to a TX/RX ring (especially in zerocopy mode), allowing the user to masquarade as any sender, essentially CAP_NET_RAW. However, in sfc mode onload already provides CAP_NET_RAW on various code paths via oo_cplane_empower_cap_net_raw, so I don't think this is breaking an existing security boundary.
This seems fine to me.
But somewhat more concerning is af_xdp.c's use of sys_call_area, especially for the XDP_MMAP_OFFSETS syscall. The syscall provides an memory offset, and if unchecked, gives path to a OOB memory access if userspace manages to race writing to that area before it is unmapped. I thought of two ways around this:
I don't really understand the issue here - is the problem that an onloaded userspace app can make this syscall while privileges are breifly raised for the task. Why change in privileges cause this issue? Anyway, it sounds like you've found a solution to this and will update the PR accordingly.
There's a TODO saying how it fails when the process does not have CAP_NET_RAW. This is true, so to workaround that, swap to a temporary global-root cred from `prepare_kernel_cred(&init_task)`. While in theory one could create a less-privileged cred via `prepare_creds()` and individually raise the required capability bits, just like in oo_cplane_empower_cap_net_raw, unfortunately xdp_umem_create -> xdp_umem_reg -> xdp_umem_account_pages requires CAP_IPC_LOCK in the init user ns, and escalating to init user ns while keeping capability bits sounds meaningless to me. The netns of the process is backed by the nsproxy and not the cred so this should not affect netns. The security concern of XDP_MMAP_OFFSETS being modified by untrusted userspace is addressed in the follow-up patch Signed-off-by: YiFei Zhu <zhuyifei@google.com>
...dware getsockopt writes to untrusted user memory, and if this is performed during stack initialization in af_xdp_init by an adversarial user, it's possible for them to race the memory being unmapped causing potential OOB in the kernel. On the other hand, af_xdp_nic_init_hardware may only be called by root. The getsockopt may be performed earlier here. A temporary socket is created to perform the getsockopt. Since the mmap offsets are just struct offsets, the offsets will not change during runtime. Signed-off-by: YiFei Zhu <zhuyifei@google.com>
c8c5e35 to
8bc06a6
Compare
zhuyifei1999
commented
Apr 24, 2025
I don't really understand the issue here - is the problem that an onloaded userspace app can make this syscall while privileges are breifly raised for the task. Why change in privileges cause this issue?
Looking at the code, specifically these two comments:
Lines 47 to 58 in 62861db
Lines 618 to 625 in 62861db
The assumption is that non-root isn't allowed to create af_xdp onload stacks, so userspace racing and overwriting mmap_offsets before it's unmapped isn't too bad, "because they are already root". However, if non-root is allowed to create af_xdp onload stacks, like what this patch does, it's much worse because now unprivileged users might break the kernel.
Anyway, it sounds like you've found a solution to this and will update the PR accordingly.
Yep, just pushed. Sorry for the delay.
@ligallag-amd
ligallag-amd
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me - I like what you've done in the second commit. Could you please explain the following for me? I'm fairly unfamiliar with this area. Why does having the capability for CAP_IPC_LOCK mean we'd may as well raise the task capability further than strictly necessary?
"unfortunately xdp_umem_create -> xdp_umem_reg -> xdp_umem_account_pages requires CAP_IPC_LOCK in the init user ns, and escalating to init user ns while keeping capability bits sounds meaningless to me."
Also, may I ask what sort of testing you've done?
zhuyifei1999
commented
Apr 28, 2025
Why does having the capability for CAP_IPC_LOCK mean we'd may as well raise the task capability further than strictly necessary?
The difference between the CAP_NET_RAW check and CAP_IPC_LOCK is the namespace in which the check is performed against.
The CAP_NET_RAW checks against the user namespace owning the network namespace: https://github.com/torvalds/linux/blob/b4432656b36e5cc1d50a1f2dc15357543add530e/net/xdp/xsk.c#L1704-L1711
While the CAP_IPC_LOCK checks against the initial user namespace (initial user namespace being the namespace that's outside any containers, i.e. "global root", root of the machine rather than root of a container): https://github.com/torvalds/linux/blob/b4432656b36e5cc1d50a1f2dc15357543add530e/net/xdp/xdp_umem.c#L128-L133
The check here is not completely fatal, but rather constrains the umem size to the RLIMIT_MEMLOCK (max locked memory) rlimit. On a Rocky Linux 9 machine this was tested on, the max locked memory by default is hard limited to 8MiB:
$ ulimit -aH | grep locked
max locked memory (kbytes, -l) 8192
.. which was insufficient.
So if onload wants to reliably run in af_xdp mode across distros, it might want to raise the rlimit, but to do that, it requires another capability, CAP_SYS_RESOURCE, in the initial user namespace, https://github.com/torvalds/linux/blob/b4432656b36e5cc1d50a1f2dc15357543add530e/kernel/sys.c#L1492-L1498
... which makes it redundant to just having CAP_IPC_LOCK to ignore the rlimit rather than having CAP_SYS_RESOURCE and raising the limit.
Now why checking against initial user namespace is so problematic? It's because of the interaction between capabilities and user namespaces. To quote the man page https://man7.org/linux/man-pages/man7/user_namespaces.7.html
On the other hand, that process has no capabilities in the parent (in the case of clone(2)) or previous (in the case of unshare(2)) and setns(2)) user namespace, even if the new namespace is created or joined by the root user (i.e., a process with user ID 0 in the root namespace).
This means that if onload, running in a container with userns, were to set CAP_IPC_LOCK on itself, it would not work, because the initial user namespace is a parent and it has no capabilities in a parent user namespace. So if we'd want only strictly necessary capabilities, it would still mean, somehow jump to the initial user namespace, setting CAP_IPC_LOCK on itself. But then, to quote the man page again:
If a process has a capability in a user namespace, then it has that capability in all child (and further removed descendant) namespaces as well.
It means it gains that capability across all namespaces (because the initial user ns is ancestor of everything else). And not to mention, there's no existing implementation of "jumping to initial user namespace with only certain capability bits set" in the kernel. I would want to empty the effective capability bits before applying bits like CAP_NET_RAW and CAP_IPC_LOCK. I'd need to implement "temporary switching user namespace but keep everything else", which... the closest thing in the kernel would be the setns syscall:
... which, is not some exported kernel functions where I can say "ok ignore permissions and just do it" (normally you're not allowed return to an ancestor user namespace), but rather full of implementation details that I'm definitely uncomfortable messing around in, especially in an out-of-tree module.
Considering that the alternative is complex and would already give more permissions than strictly necessary (i.e. CAP_NET_RAW CAP_IPC_LOCK across all user namespaces), I just don't think it makes sense to pursue that direction instead of just using prepare_kernel_cred / override_creds. I could go and lower unneeded capability bits from the cred returned by prepare_kernel_cred, but tbh I don't see the value in that, given that nothing outside this function should ever be able to utilize that cred. It's functionally the same as "a root daemon created that fd and passed to this process with SCM_RIGHTS".
Also, may I ask what sort of testing you've done?
This was tested a while back without the second patch so I don't fully remember, but IIRC we tested latency and WODA. We observed back packet losses to sshd (non-onload-accelerated), which I found to be related to flow steering not being fully functional and the data path of onload -> kernel was failing due to inject_kernel_gid restrictions. It was workarounded with echo -1 > /sys/module/onload/parameters/inject_kernel_gid. IIRC after that everything worked as expected.
With the second patch I only did basic functionality testing with netcat.
ligallag-amd
commented
Apr 28, 2025
Thanks for your detailed explanation YiFei - I've merged this internally and it should sync to our public repo.
There's a TODO saying how it fails when the process does not have CAP_NET_RAW. This is true, so to workaround that, swap to a temporary global-root cred from
prepare_kernel_cred(&init_task).While in theory one could create a less-privileged cred via
prepare_creds()and individually raise the required capability bits, just like in oo_cplane_empower_cap_net_raw, unfortunately xdp_umem_create -> xdp_umem_reg -> xdp_umem_account_pages requires CAP_IPC_LOCK in the init user ns, and escalating to init user ns while keeping capability bits sounds meaningless to me.The netns of the process is backed by the nsproxy and not the cred so this should not affect netns. Though I'm not sure about the security implications of this, considering onload is able to run rootless in non-AF_XDP mode already.