-
Notifications
You must be signed in to change notification settings - Fork 521
Does iouring do DMA? #1277
-
Forgive my naive terminology, I'm quite new to the topic:
So I worked with iouring a little bit, but I always wondered what optimizations are in place to make DMA transfers, either from other hosts (RDMA) or from disk to memory? Will iouring make use of potential DMA capabilities if I e.g. ask the interface to offload some disk data to a specific virtual memory location in userspace?
I hope this even makes sense.
Beta Was this translation helpful? Give feedback.
All reactions
Replies: 4 comments 5 replies
-
Not sure I follow the question at all... For normal read/write operation, whether DMA is used or not is entirely up to the subsystem that is hosting the IO. Let's say it's a read of a file on an nvme device, yes DMA is certainly used, as PIO would be very slow. But io_uring doesn't know or care, it simply asks for "please put data from location X into memory Y". For normal buffered IO, Y is a page cache location, kernel memory. For O_DIRECT IO, Y is a userspace address, so the device generally DMA's directly into the userspace page.
Beta Was this translation helpful? Give feedback.
All reactions
-
Thanks for your answer @axboe . So am I understanding correctly all I can do from the userspace side with the ioruing api is specify O_DIRECT and let the kernel do the inference on whether DMA is possible or not? I thought there were mechanisms in the hand of a programmer that you can use to initiate DMA transfers directly. I'm just trying to figure out if I would need to step down into the kernel somehow to enable DMA or ioruing does that for me. Also maybe I should mention that I'm very interested in the RDMA capabilities of iouring as well. If you say that ioruing does not care about those intrinsics, who does? Will the kernel make sure that RDMA is used whenever possible (e.g. given a supported network card, bus etc..)? This would be the best case for me. But maybe RDMA is a whole different can of worms?
To give some context: I would like to give our database the capability to retrieve data from a storage backend regardless if the storage backend is a local disk or another machine entirely. If I could use ioruing for both (DMA disk reads, RDMA reads from other network nodes) that would significantly simplify everything for me.
Beta Was this translation helpful? Give feedback.
All reactions
-
I'm still interested in this for our application. We really would like to enable RDMA for exchange of big binary data over ethernet(RoCE)/infiniband, but it seems that at least for InfiniBand one needs to use a specialized library (libibverbs) to induce the connection. From the surface it looks thought like something the liburing or its kernel-side implementation might be able to abstract that away, so my question is: does it?
Beta Was this translation helpful? Give feedback.
All reactions
-
@MartyMcFlyInTheSky the kernel doesn't do any RDMA by itself. You need a library and special hardware with the capabilities and the driver to match it. Then, you have to set it up from userspace entirely.
Since RDMA requires a supporting network protocol (bare TCP doesn't work), io_uring can't do it. io_uring replicates the base kernel networking interface. The closest you can get is io_uring_prep_send_zc(), but that's still mimic-ing a normal send(), so it won't automatically be put in the user memory of the receiving end.
Beta Was this translation helpful? Give feedback.
All reactions
-
Thanks for enlighting me, I was really hoping to get an answer. So basically from a practical perspective that means I could do "normal" networking and local disk IO using iouring, but for RDMA I would have to configure my own stack? By any chance do you know where I can read more about that? There must be documentation somewhere how I can set this up from userspace. Or libraries that do that for me.
Of course It would be amazing if iouring would also serve as an abstraction over that, but I guess that's a bit much to ask..
Beta Was this translation helpful? Give feedback.
All reactions
-
Learning about RDMA is still a work in progress on my side, there's also the need for hardware that supports (which I have in my homelab, but not configured yet, lol).
I would start reading Nvidia's and Mellanox's (acquired by Nvidia) documentation, mostly around Infiniband. Seems like Nvidia is the only real player in the Infiniband world right now, but IB has been widely deployed in supercomputers and datacenters, it's older and is in general more documented.
Then, you could go look for iWARP and RoCEv2 (you probably want IB or RoCEv2). They are less documented but should be more supported, as they work over standard TCP and UDP.
To be honest, this is one of those topics I have pending for a deep dive, the idea of sending data straight to other node's memory seems so cool to me.
Btw, in my original comment, I forgot to mention that the kernel does have some notion of RDMA (at least for IB, called the verbs API). So you need a kernel that is compiled with that enabled, although the standard server distros should have it.
Beta Was this translation helpful? Give feedback.
All reactions
-
Note that libibverbs already has submission and completion queues inside the queue pair, so it's spiritually already similar to io_uring.
Beta Was this translation helpful? Give feedback.
All reactions
-
Is there a possibility that io_uring could swallow libibverbs? I don't see the need for two players in the same area. Also, it would make everything a lot easier and manageable. Since RoCEv2 works over TCP and UDP, this would also be technically possible, no? Maybe @axboe could comment on the current focus direction of io_uring. Could we expect something like this in the next few years?
Beta Was this translation helpful? Give feedback.
All reactions
-
@MartyMcFlyInTheSky, it's not likely for a bunch of reasons. You can try out io_uring zcrx, I and David merged several months ago. And there are other other host-memory-avoidance use cases that are already WIP.
Beta Was this translation helpful? Give feedback.