UDP GSO support #135

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Open

nyrahul wants to merge 6 commits into litespeedtech:master

from nyrahul:gso

Open

UDP GSO support #135

nyrahul wants to merge 6 commits into litespeedtech:master from nyrahul:gso

Conversation

nyrahul

Copy link

Contributor

@nyrahul nyrahul commented May 26, 2020 •

edited

Loading

Implementation highlights

Batching iovecs to utilize UDP-GSO in send_packets_out callback
* Packets of equal length are batched together. The last packet can be of a smaller length.
* setup_ctl_msg updated to set the CMSG hdr of UDP_SEGMENT with gso_size passed as pktlen
* send_gso internally leverages send_packets_one_by_one() to send batched iovecs with the GSO CMSG.
-O BURST command-line option added ... BURST is the max packets to coalesce in single call.
* Even if -O is specified, a runtime check is added to check for UDP-GSO support since the binaries may be transferred to another system using a different kernel.

Preliminary Data

Scenario: Transfer 1GB file locally using http_client/http_server app and use perf record to profile only the server who serves the file.

Note: A lot depends on how the sample application calls the stream_write()/stream_flush() APIs. However, without any changes to the sample http client/server app I was able to get an aggregation of roughly ~6 packets. I tried GSO with burst size of 10 packets and 20 packets. One can check the aggregation efficiency by enabling debug logs and checking LSQ_DEBUG("GSO with burst:%d", vcnt).

Following is the perf-record and diff for:

send_packets_one_by_one()
send_packets_using_sendmmsg()
send_packets_using_gso()}

image

I have my own application which simulates RPC scenario with user-space lsquic pacing disabled and simulates long responses involving 1.2KB to 100KB and I easily get a batching/aggregation of 10/20/30 packets and much better CPU usage reduction.

nyrahul added 4 commits

May 18, 2020 23:59

@nyrahul


 fix for BORINGSSL_LIB and BORINGSSL_INCLUDE paths with cmake

203652c

@nyrahul


 fix boringssl lib search with different build dir

8bfd9a0

@nyrahul


 Merge branch 'master' of https://github.com/litespeedtech/lsquic into...

180237f

... andsupp

@nyrahul


 UDP GSO support

d25c3cb

@dtikhonov

Copy link

Collaborator

dtikhonov commented May 26, 2020

Thank you -- this is interesting!

nyrahul added 2 commits

June 8, 2020 19:28

@nyrahul


 enabling GSO only for linux

@nyrahul


 merged to upstream

6b5bc6d

@joho-lab

Copy link

joho-lab commented Sep 6, 2020

How I can add a new congestion algorithm in lsquic

@nyrahul

Copy link

Contributor Author

nyrahul commented Sep 7, 2020

How I can add a new congestion algorithm in lsquic

I certainly believe this question has no relevance to this PR. I suggest you raise this query as a general issue. To augment my experience in context to congestion algorithm handling.... lsquic nicely decouples these algos and provides clear entry points for congestion events .. You can check existing implemented cubic/bbr algos and it should be easy to figure out how to add a new algo.

dtikhonov

dtikhonov reviewed

Sep 27, 2020

View reviewed changes

bin/test_common.c

if (vcnt == 0) {

memcpy(&newspec, &specs[i], sizeof(struct lsquic_out_spec));

} else if(!match_spec(&newspec, &specs[i])) {

// new specs dont match prev ones, so send the previous iovec batch

Copy link

Collaborator

@dtikhonov dtikhonov Sep 27, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When there are two connections (or more) that send packets, the lsquic engine interleaves the packets. In that scenario, you'd end up with sending single packets most of the time. I believe a better approach is to perform the check whether all the specs match in sport_packets_out(). If they don't match, use non-GSO sending method.

Copy link

Contributor Author

@nyrahul nyrahul Sep 28, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, I was aware that two connections packets could be interleaved in the same call.

The match_spec(&newspec, &specs[i]) takes care of identifying it. If the spec differs then the batching is stopped and whatever batch is currently available is sent. Does that make sense?

Copy link

Collaborator

@dtikhonov dtikhonov Sep 28, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it makes sense. But I wanted to emphasize that when there is more than one connection, one is likely to end up with specs never matching and thus, given N packets to send, we'll send N batches of one. That's why I suggested to perform the check earlier and use the result to pick the sending function: GSO or non-GSO.

@dtikhonov

Copy link

Collaborator

dtikhonov commented Sep 28, 2020

Do I read the benchmarking results correctly that the "sendmmsg" approach reduces CPU usage more than GSO?

with user-space lsquic pacing disabled

This is interesting. Did you do this to get better results?

@nyrahul

Copy link

Contributor Author

nyrahul commented Sep 30, 2020

Do I read the benchmarking results correctly that the "sendmmsg" approach reduces CPU usage more than GSO?

Yes in the context sendmmsg was actually more optimal than GSO. This may be due to the fact that I could not achieve optimized batching with the default HTTP file transfer example. Batching improves a lot when the application developer does stream flush after writing multiple data sets. I wrote a different app that does multiple stream write and then a flush and the performance of GSO is much better than sendmmsg but I didn't showcase this data since I didn't have that sample app in public.

with user-space lsquic pacing disabled

This is interesting. Did you do this to get better results?

There were two reasons: lsquic was not able to fully utilize the bandwidth with pacing enabled (with or without GSO). With pacing disabled the utilization was much improved. Secondly, I found that with pacing the iovec batches that were spewed from lsquic was also limited. I didn't end up properly debugging these points and locating the root cause.

@dtikhonov

Copy link

Collaborator

dtikhonov commented Sep 30, 2020

I wrote a different app that does multiple stream write and then a flush and the performance of GSO is much better than sendmmsg

I wonder why this would be. I'll have to think about this.

but I didn't showcase this data since I didn't have that sample app in public.

Does it mean you have at least one app in public (with source available)? I'd be curious to see how others use lsquic.

lsquic was not able to fully utilize the bandwidth with pacing enabled

What were the path characteristics? Did you use BBR or Cubic?

@nyrahul

Copy link

Contributor Author

nyrahul commented Oct 1, 2020

I wrote a different app that does multiple stream write and then a flush and the performance of GSO is much better than sendmmsg

I wonder why this would be. I'll have to think about this.

but I didn't showcase this data since I didn't have that sample app in public.

Does it mean you have at least one app in public (with source available)? I'd be curious to see how others use lsquic.

The public app I used is the same HTTP client/server app that lsquic has.

lsquic was not able to fully utilize the bandwidth with pacing enabled

What were the path characteristics? Did you use BBR or Cubic?

I used both BBR/Cubic and the performance was slightly worse in case of no-loss conditions but with >=1% loss, the performance was much worse with lsquic. I compared it with Linux kernel TCP-Cubic. I wish I could have sent those numbers back then but I waited to analyze the root cause myself and never reached there. I have since then moved out of that organization and do not have access to those results.

@litespeedtech litespeedtech force-pushed the master branch from 1f39d53 to 7686d8f Compare

February 18, 2025 13:10

Labels

None yet

3 participants

@nyrahul @dtikhonov @joho-lab

UDP GSO support #135

Are you sure you want to change the base?

UDP GSO support #135

Uh oh!

Conversation

@nyrahul nyrahul commented May 26, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Implementation highlights

Preliminary Data

Uh oh!

dtikhonov commented May 26, 2020

Uh oh!

joho-lab commented Sep 6, 2020

Uh oh!

nyrahul commented Sep 7, 2020

Uh oh!

@dtikhonov dtikhonov Sep 27, 2020

Choose a reason for hiding this comment

Uh oh!

@nyrahul nyrahul Sep 28, 2020

Choose a reason for hiding this comment

Uh oh!

@dtikhonov dtikhonov Sep 28, 2020

Choose a reason for hiding this comment

Uh oh!

dtikhonov commented Sep 28, 2020

Uh oh!

nyrahul commented Sep 30, 2020

Uh oh!

dtikhonov commented Sep 30, 2020

Uh oh!

nyrahul commented Oct 1, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

@nyrahul nyrahul commented May 26, 2020 •

edited

Loading