Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

af_xdp: Allow running onload as non-root #265

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
zhuyifei1999 wants to merge 2 commits into Xilinx-CNS:master from zhuyifei1999:rootless-afxdp

Conversation

@zhuyifei1999
Copy link
Contributor

@zhuyifei1999 zhuyifei1999 commented Feb 19, 2025

There's a TODO saying how it fails when the process does not have CAP_NET_RAW. This is true, so to workaround that, swap to a temporary global-root cred from prepare_kernel_cred(&init_task).

While in theory one could create a less-privileged cred via prepare_creds() and individually raise the required capability bits, just like in oo_cplane_empower_cap_net_raw, unfortunately xdp_umem_create -> xdp_umem_reg -> xdp_umem_account_pages requires CAP_IPC_LOCK in the init user ns, and escalating to init user ns while keeping capability bits sounds meaningless to me.

The netns of the process is backed by the nsproxy and not the cred so this should not affect netns. Though I'm not sure about the security implications of this, considering onload is able to run rootless in non-AF_XDP mode already.

@zhuyifei1999 zhuyifei1999 requested a review from a team as a code owner February 19, 2025 02:40
Copy link
Contributor Author

I looked more at the security aspect of this. There are two concerns I can think of

  • AF_XDP rings gives non-root full access to a TX/RX ring (especially in zerocopy mode), allowing the user to masquarade as any sender, essentially CAP_NET_RAW. However, in sfc mode onload already provides CAP_NET_RAW on various code paths via oo_cplane_empower_cap_net_raw, so I don't think this is breaking an existing security boundary.
  • But somewhat more concerning is af_xdp.c's use of sys_call_area, especially for the XDP_MMAP_OFFSETS syscall. The syscall provides an memory offset, and if unchecked, gives path to a OOB memory access if userspace manages to race writing to that area before it is unmapped. I thought of two ways around this:
    • Invoke the syscall though a "usermode driver" embedded to the kernel module. This incurs somewhat of an overhead with all the IPCs and actually starting the process.
    • Embed the offsets in the module with a couple offsetof() calls. The offsets are literally a bunch of struct offsets. Considering a module compile for one kernel isn't supposed to work on a different kernel anyways, there's not much reason onload absolutely have to use the getsockopt syscall to obtain the offsets.

Wdyt?

Copy link
Contributor Author

Embed the offsets in the module with a couple offsetof() calls.

It looks like struct xdp_rxtx_ring is in net/xdp/xsk_queue.h which is not exported. Grabbing the offset from UMH or BTF is complex, so rather than that, I'll grab it in af_xdp_nic_init_hardware which is trusted to be root.

Copy link
Contributor

AF_XDP rings gives non-root full access to a TX/RX ring (especially in zerocopy mode), allowing the user to masquarade as any sender, essentially CAP_NET_RAW. However, in sfc mode onload already provides CAP_NET_RAW on various code paths via oo_cplane_empower_cap_net_raw, so I don't think this is breaking an existing security boundary.

This seems fine to me.

But somewhat more concerning is af_xdp.c's use of sys_call_area, especially for the XDP_MMAP_OFFSETS syscall. The syscall provides an memory offset, and if unchecked, gives path to a OOB memory access if userspace manages to race writing to that area before it is unmapped. I thought of two ways around this:

I don't really understand the issue here - is the problem that an onloaded userspace app can make this syscall while privileges are breifly raised for the task. Why change in privileges cause this issue? Anyway, it sounds like you've found a solution to this and will update the PR accordingly.

There's a TODO saying how it fails when the process does not have
CAP_NET_RAW. This is true, so to workaround that, swap to a
temporary global-root cred from `prepare_kernel_cred(&init_task)`.
While in theory one could create a less-privileged cred via
`prepare_creds()` and individually raise the required capability
bits, just like in oo_cplane_empower_cap_net_raw, unfortunately
xdp_umem_create -> xdp_umem_reg -> xdp_umem_account_pages requires
CAP_IPC_LOCK in the init user ns, and escalating to init user ns
while keeping capability bits sounds meaningless to me.
The netns of the process is backed by the nsproxy and not the cred
so this should not affect netns. The security concern of
XDP_MMAP_OFFSETS being modified by untrusted userspace is addressed
in the follow-up patch
Signed-off-by: YiFei Zhu <zhuyifei@google.com>
...dware
getsockopt writes to untrusted user memory, and if this is performed
during stack initialization in af_xdp_init by an adversarial user,
it's possible for them to race the memory being unmapped causing
potential OOB in the kernel.
On the other hand, af_xdp_nic_init_hardware may only be called by root.
The getsockopt may be performed earlier here. A temporary socket is
created to perform the getsockopt. Since the mmap offsets are just
struct offsets, the offsets will not change during runtime.
Signed-off-by: YiFei Zhu <zhuyifei@google.com>
Copy link
Contributor Author

I don't really understand the issue here - is the problem that an onloaded userspace app can make this syscall while privileges are breifly raised for the task. Why change in privileges cause this issue?

Looking at the code, specifically these two comments:

/* sys_call_area: a process-mapped area which can be used to perform
* system calls from a module.
*
* There is no attempt to prevent the process from tampering this data.
* This is why sys_call_area MUST NOT be used when executing a system call
* with escalated privileges. I.e. any system call which reads from or
* writes to this area MUST be subject to normal security checks by kernel.
*
* The area is always mapped read-write, because in both cases the data is
* written to the area (written by module and read by syscall or
* vise-versa). See also some comments about FOLL_WRITE in linux/mm/gup.c.
*/

/* Security consideration: mmap_offsets is located in untrusted user
* memory. I.e. the process can overwrite all this data.
* However this is the process which can create an AF_XDP Onload stack,
* so it runs with the root account, and it already can do
* anything bad: reboot, execute arbitrary code, etc.
*
* However we do our best: unmap the area from UL ASAP, before use.
*/

The assumption is that non-root isn't allowed to create af_xdp onload stacks, so userspace racing and overwriting mmap_offsets before it's unmapped isn't too bad, "because they are already root". However, if non-root is allowed to create af_xdp onload stacks, like what this patch does, it's much worse because now unprivileged users might break the kernel.

Anyway, it sounds like you've found a solution to this and will update the PR accordingly.

Yep, just pushed. Sorry for the delay.

Copy link
Contributor

@ligallag-amd ligallag-amd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me - I like what you've done in the second commit. Could you please explain the following for me? I'm fairly unfamiliar with this area. Why does having the capability for CAP_IPC_LOCK mean we'd may as well raise the task capability further than strictly necessary?

"unfortunately xdp_umem_create -> xdp_umem_reg -> xdp_umem_account_pages requires CAP_IPC_LOCK in the init user ns, and escalating to init user ns while keeping capability bits sounds meaningless to me."

Also, may I ask what sort of testing you've done?

Copy link
Contributor Author

Why does having the capability for CAP_IPC_LOCK mean we'd may as well raise the task capability further than strictly necessary?

The difference between the CAP_NET_RAW check and CAP_IPC_LOCK is the namespace in which the check is performed against.

The CAP_NET_RAW checks against the user namespace owning the network namespace: https://github.com/torvalds/linux/blob/b4432656b36e5cc1d50a1f2dc15357543add530e/net/xdp/xsk.c#L1704-L1711

While the CAP_IPC_LOCK checks against the initial user namespace (initial user namespace being the namespace that's outside any containers, i.e. "global root", root of the machine rather than root of a container): https://github.com/torvalds/linux/blob/b4432656b36e5cc1d50a1f2dc15357543add530e/net/xdp/xdp_umem.c#L128-L133

The check here is not completely fatal, but rather constrains the umem size to the RLIMIT_MEMLOCK (max locked memory) rlimit. On a Rocky Linux 9 machine this was tested on, the max locked memory by default is hard limited to 8MiB:

$ ulimit -aH | grep locked
max locked memory (kbytes, -l) 8192

.. which was insufficient.

So if onload wants to reliably run in af_xdp mode across distros, it might want to raise the rlimit, but to do that, it requires another capability, CAP_SYS_RESOURCE, in the initial user namespace, https://github.com/torvalds/linux/blob/b4432656b36e5cc1d50a1f2dc15357543add530e/kernel/sys.c#L1492-L1498

... which makes it redundant to just having CAP_IPC_LOCK to ignore the rlimit rather than having CAP_SYS_RESOURCE and raising the limit.

Now why checking against initial user namespace is so problematic? It's because of the interaction between capabilities and user namespaces. To quote the man page https://man7.org/linux/man-pages/man7/user_namespaces.7.html

On the other hand, that process has no capabilities in the parent (in the case of clone(2)) or previous (in the case of unshare(2)) and setns(2)) user namespace, even if the new namespace is created or joined by the root user (i.e., a process with user ID 0 in the root namespace).

This means that if onload, running in a container with userns, were to set CAP_IPC_LOCK on itself, it would not work, because the initial user namespace is a parent and it has no capabilities in a parent user namespace. So if we'd want only strictly necessary capabilities, it would still mean, somehow jump to the initial user namespace, setting CAP_IPC_LOCK on itself. But then, to quote the man page again:

If a process has a capability in a user namespace, then it has that capability in all child (and further removed descendant) namespaces as well.

It means it gains that capability across all namespaces (because the initial user ns is ancestor of everything else). And not to mention, there's no existing implementation of "jumping to initial user namespace with only certain capability bits set" in the kernel. I would want to empty the effective capability bits before applying bits like CAP_NET_RAW and CAP_IPC_LOCK. I'd need to implement "temporary switching user namespace but keep everything else", which... the closest thing in the kernel would be the setns syscall:

https://github.com/torvalds/linux/blob/b4432656b36e5cc1d50a1f2dc15357543add530e/kernel/nsproxy.c#L569-L581

... which, is not some exported kernel functions where I can say "ok ignore permissions and just do it" (normally you're not allowed return to an ancestor user namespace), but rather full of implementation details that I'm definitely uncomfortable messing around in, especially in an out-of-tree module.

Considering that the alternative is complex and would already give more permissions than strictly necessary (i.e. CAP_NET_RAW CAP_IPC_LOCK across all user namespaces), I just don't think it makes sense to pursue that direction instead of just using prepare_kernel_cred / override_creds. I could go and lower unneeded capability bits from the cred returned by prepare_kernel_cred, but tbh I don't see the value in that, given that nothing outside this function should ever be able to utilize that cred. It's functionally the same as "a root daemon created that fd and passed to this process with SCM_RIGHTS".

Also, may I ask what sort of testing you've done?

This was tested a while back without the second patch so I don't fully remember, but IIRC we tested latency and WODA. We observed back packet losses to sshd (non-onload-accelerated), which I found to be related to flow steering not being fully functional and the data path of onload -> kernel was failing due to inject_kernel_gid restrictions. It was workarounded with echo -1 > /sys/module/onload/parameters/inject_kernel_gid. IIRC after that everything worked as expected.

With the second patch I only did basic functionality testing with netcat.

Copy link
Contributor

Thanks for your detailed explanation YiFei - I've merged this internally and it should sync to our public repo.

zhuyifei1999 reacted with heart emoji

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Reviewers

@ligallag-amd ligallag-amd ligallag-amd approved these changes

Assignees

No one assigned

Labels

None yet

Projects

None yet

Milestone

No milestone

Development

Successfully merging this pull request may close these issues.

AltStyle によって変換されたページ (->オリジナル) /