-
Notifications
You must be signed in to change notification settings - Fork 521
-
I'm using liburing via Java Foreign function interface to do random writes across a set of 2200 files. I am seeing some unexpected high CPU usage compared to fio, and java's filechannel. I was hoping someone could give me some pointers to where to look why this happens.
Setup:
- ext4 on nvme
- 2211 files shared by all threads
- Each thread gets its own ring
- Flags used:
IORING_SETUP_SINGLE_ISSUER,IORING_SETUP_COOP_TASKRUN,IORING_SETUP_DEFER_TASKRUN - Buffed IO
- write size 512B to 16KB
- using fixed files
- kernel
6.14.0-27-generic
Scaling behavior:
My bindings scale well to 4 threads but each thread doesn't add as much performance as the previous one and after 8 threads it starts performing worse.
Using perf record -e 'lock:*' -g --call-graph=dwarf -F 997 -p 23523 i am seeing some contention:
which kind of explains the iostat i am seeing... my bindings:
avg-cpu: %user %system %iowait %idle
5.63 81.40 7.00 5.97
Vs Fio
avg-cpu: %user %system %iowait %idle
0.00 0.34 99.66 0.00
Fio job:
[global]
ioengine=io_uring
rw=randwrite
direct=0
bs=512
size=512
runtime=60
time_based=1
group_reporting=0
randrepeat=0
norandommap=1
refill_buffers=0
iodepth=32
fixedbufs=0
hipri=0
registerfiles=1
sqthread_poll=0
[shared_files_job]
numjobs=8
nrfiles=400
filesize=10M
filename_format=testfile.$filenum
file_service_type=roundrobin
sqthread_poll=0
I understand Java bindings won’t match fio’s raw speed, but I expected the profile to be more I/O-bound like fio, not the opposite. In essence each benchmark thread in java is doing the following:
setup_ring(...);
loop until done {
submit N io_uring_prep_write() calls using a fixed file/buffer; // keep N tasks in flight
io_uring_submit();
io_uring_wait_cqe_nr(N);
}
I tried different versions of the previous example submitting less/more often, peeking and waiting for n CQE's different queue sizes but nothing seems to make it perform more like fio. I also created a version of this with random reads but that doesn't suffer from a scaling issue.
I’m looking for guidance on how to structure multithreaded buffered I/O with shared files in a way that avoids the CPU bottlenecks I’m seeing.
Any insight into how best to mitigate it would be appreciated.
Thanks
UPDATE:
Some perf top results:
Single threaded:
Overhead Shared Object Symbol
7.57% [kernel] [k] try_to_wake_up
6.14% [kernel] [k] io_init_req
5.72% [kernel] [k] io_issue_sqe
5.31% [kernel] [k] llist_reverse_order
5.30% [kernel] [k] _raw_spin_lock
4.56% [kernel] [k] __schedule
3.10% [kernel] [k] kfree
3.03% [kernel] [k] __pi_memset_generic
2.94% [kernel] [k] __slab_free
2.58% [kernel] [k] io_clean_op
6 threads:
Overhead Shared Object Symbol
10.56% [kernel] [k] io_init_req
6.32% [kernel] [k] io_clean_op
5.83% [kernel] [k] io_issue_sqe
3.57% [kernel] [k] io_prep_async_work
3.18% [kernel] [k] llist_reverse_order
3.13% [kernel] [k] __pi_memset_generic
3.11% libc.so.6 [.] _int_malloc
2.73% [kernel] [k] __pi_clear_page
2.18% [JIT] tid 298374 [.] 0x0000e559d4171db0
2.17% [kernel] [k] __slab_free
2.12% liburing-ffi.so.2.9 [.] io_uring_prep_write
Beta Was this translation helpful? Give feedback.
All reactions
Replies: 3 comments 2 replies
-
Since you closed without further comments, anything else to add here? You have a lot of io-wq contention, which is most likely due to either the bigger write sizes or just the storage and file system used. You will probably see better performance if you have the threads share the io-wq backend, potentially, using IORING_SETUP_ATTACH_WQ to set up subsequent thread rings with struct io_uring_params->wq_fd set to the first ring.
Beta Was this translation helpful? Give feedback.
All reactions
-
Thanks I will try to see if IORING_SETUP_ATTACH_WQ helps with multithreading. I tried writing 10 bytes instead of 4096 but that doesn't make a difference. The Java code I am comparing against uses the same files on the same storage and file system. Maybe Java does some trickery but I think I can rule out those two variables(?).
The weird thing is that even with a single ring the CPU usage jumps from an Idle 4% to 47% during the benchmark while Java's filechannel only goes up to 17% CPU. This only happens with writing. The read benchmark (uring vs filechannel) that are almost the same the writing ones do not show this behavior and stay within a few percentile of each other.
I guess my question is what would be a good way to see what those worker threads are up to?
Beta Was this translation helpful? Give feedback.
All reactions
-
Tried running it with 20 threads, each with their own ring connected to a single ring using IORING_SETUP_ATTACH_WQ . The scores stay about the same but with the flag it uses around 100 threads more 2600 vs 2500 (across multiple runs). Guessing it's probably the work itself causing the issue.
Beta Was this translation helpful? Give feedback.
All reactions
-
2500 or 2600 threads is literally insane! That would be a huge efficiency issue, it's way too many threads for driving any kind of IO. What sq/cq depths are you setting up the ring with?
Beta Was this translation helpful? Give feedback.
All reactions
-
|
I ran the benchmark with different number of threads and queue depths to see how many worker threads it creates (using
|
Beta Was this translation helpful? Give feedback.