Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

High CPU usage with Java FFI + liburing (vs fio and FileChannel) #1449

Unanswered
davidtos asked this question in Q&A
Discussion options

I'm using liburing via Java Foreign function interface to do random writes across a set of 2200 files. I am seeing some unexpected high CPU usage compared to fio, and java's filechannel. I was hoping someone could give me some pointers to where to look why this happens.

Setup:

  • ext4 on nvme
  • 2211 files shared by all threads
  • Each thread gets its own ring
  • Flags used: IORING_SETUP_SINGLE_ISSUER, IORING_SETUP_COOP_TASKRUN, IORING_SETUP_DEFER_TASKRUN
  • Buffed IO
  • write size 512B to 16KB
  • using fixed files
  • kernel 6.14.0-27-generic

Scaling behavior:
My bindings scale well to 4 threads but each thread doesn't add as much performance as the previous one and after 8 threads it starts performing worse.

Using perf record -e 'lock:*' -g --call-graph=dwarf -F 997 -p 23523 i am seeing some contention:

image

which kind of explains the iostat i am seeing... my bindings:

avg-cpu: %user %system %iowait %idle
 5.63 81.40 7.00 5.97

Vs Fio

avg-cpu: %user %system %iowait %idle
 0.00 0.34 99.66 0.00

Fio job:

[global]
ioengine=io_uring
rw=randwrite
direct=0
bs=512
size=512
runtime=60
time_based=1
group_reporting=0
randrepeat=0
norandommap=1
refill_buffers=0
iodepth=32
fixedbufs=0
hipri=0
registerfiles=1
sqthread_poll=0
[shared_files_job]
numjobs=8
nrfiles=400
filesize=10M
filename_format=testfile.$filenum
file_service_type=roundrobin
sqthread_poll=0

I understand Java bindings won’t match fio’s raw speed, but I expected the profile to be more I/O-bound like fio, not the opposite. In essence each benchmark thread in java is doing the following:

setup_ring(...);
loop until done {
 submit N io_uring_prep_write() calls using a fixed file/buffer; // keep N tasks in flight
 io_uring_submit();
 io_uring_wait_cqe_nr(N);
}

I tried different versions of the previous example submitting less/more often, peeking and waiting for n CQE's different queue sizes but nothing seems to make it perform more like fio. I also created a version of this with random reads but that doesn't suffer from a scaling issue.

I’m looking for guidance on how to structure multithreaded buffered I/O with shared files in a way that avoids the CPU bottlenecks I’m seeing.

Any insight into how best to mitigate it would be appreciated.

Thanks


UPDATE:

Some perf top results:

Single threaded:

Overhead Shared Object Symbol 
 7.57% [kernel] [k] try_to_wake_up 
 6.14% [kernel] [k] io_init_req 
 5.72% [kernel] [k] io_issue_sqe 
 5.31% [kernel] [k] llist_reverse_order 
 5.30% [kernel] [k] _raw_spin_lock 
 4.56% [kernel] [k] __schedule 
 3.10% [kernel] [k] kfree 
 3.03% [kernel] [k] __pi_memset_generic 
 2.94% [kernel] [k] __slab_free 
 2.58% [kernel] [k] io_clean_op 

6 threads:

Overhead Shared Object Symbol 
 10.56% [kernel] [k] io_init_req 
 6.32% [kernel] [k] io_clean_op 
 5.83% [kernel] [k] io_issue_sqe 
 3.57% [kernel] [k] io_prep_async_work 
 3.18% [kernel] [k] llist_reverse_order 
 3.13% [kernel] [k] __pi_memset_generic 
 3.11% libc.so.6 [.] _int_malloc 
 2.73% [kernel] [k] __pi_clear_page 
 2.18% [JIT] tid 298374 [.] 0x0000e559d4171db0 
 2.17% [kernel] [k] __slab_free 
 2.12% liburing-ffi.so.2.9 [.] io_uring_prep_write 
You must be logged in to vote

Replies: 3 comments 2 replies

Comment options

Since you closed without further comments, anything else to add here? You have a lot of io-wq contention, which is most likely due to either the bigger write sizes or just the storage and file system used. You will probably see better performance if you have the threads share the io-wq backend, potentially, using IORING_SETUP_ATTACH_WQ to set up subsequent thread rings with struct io_uring_params->wq_fd set to the first ring.

You must be logged in to vote
1 reply
Comment options

Thanks I will try to see if IORING_SETUP_ATTACH_WQ helps with multithreading. I tried writing 10 bytes instead of 4096 but that doesn't make a difference. The Java code I am comparing against uses the same files on the same storage and file system. Maybe Java does some trickery but I think I can rule out those two variables(?).

The weird thing is that even with a single ring the CPU usage jumps from an Idle 4% to 47% during the benchmark while Java's filechannel only goes up to 17% CPU. This only happens with writing. The read benchmark (uring vs filechannel) that are almost the same the writing ones do not show this behavior and stay within a few percentile of each other.

I guess my question is what would be a good way to see what those worker threads are up to?

Comment options

Tried running it with 20 threads, each with their own ring connected to a single ring using IORING_SETUP_ATTACH_WQ . The scores stay about the same but with the flag it uses around 100 threads more 2600 vs 2500 (across multiple runs). Guessing it's probably the work itself causing the issue.

You must be logged in to vote
1 reply
Comment options

2500 or 2600 threads is literally insane! That would be a huge efficiency issue, it's way too many threads for driving any kind of IO. What sq/cq depths are you setting up the ring with?

Comment options

I ran the benchmark with different number of threads and queue depths to see how many worker threads it creates (using ps -o pid,tid,comm -p 107901 -L | grep iou -c). It is running on a 16C/32T machine, seeing your reaction I guess anything more than 100 threads is a bit excessive. Tried io_uring_register_iowq_max_workers but that just lowers the benchmark scores.

Threads Depth Workers
1 16 14
32 16 608
1 32 19
32 32 1111
1 64 30
32 64 2096
1 128 100
32 128 4096
You must be logged in to vote
0 replies
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Category
Q&A
Labels
None yet
2 participants

AltStyle によって変換されたページ (->オリジナル) /