In this track we want to discuss topics related to bufferbloat and networking performance including latency reduction, congestion control, queue management, TCP algorithm enhancements, and related topics.

Blueprint Name Prefix

When submitting a proposal for this track, use the prefix "lpc2012-net-" for the "Name" field.

Submissions

Submissions to the micro conference:

Byte queue limits revisited - Tomas Hruby
Linux traffic classification and shaping - John Fastabend
CoDel and FQ CoDel - Eric Dumazet
Data Direct I/O - John Ronciak
open-vswitch - Shyam Iyer
Congestion manager for TCP - Yuchung Cheng
TCP Loss Probe - Nandita Dukkipati
Multipath TCP - Christoph Paasch
HW ratelimiting - Jesse Brandeburg

Schedule

The schedule of the 2012 networking and bufferbloat Plumbers Micro Conference will be posted...

Notes for Networking MC Linux Plumbers Conference 2012

CoDel and FQ CoDel

Eric Dumazet

- Queuing is needed and necessary - What size of queue do we want - What about controlling delays - Idea is to control delay in a given interval - Want a simple control loop - Codel works great, numbers with fq_codel are great - Question: Should we replace pfifo_fast with Codel? - Noted that Codel does not burn too much CPU, shows we don’t need to be stuck with fifos forever - Question: Can we pull wireless complexity into the host (wireless device implement complex somewhat arcane packet scheduling on their own)

Byte Queue Limits revisited

Tomas Hruby

- Motivation for BQL: limit queuing in the NIC - Set right limit in the NIC - Based on bytes, not number of packets - Dynamic and adaptive algorithm - Completion event needs to be periodic - Two conditions to increase limit, based on history - Decreasing limiting, avoid hysteresis - BQL performance much better than default configuration, HW priority still better in some cases - Issue: Some cliffs where we adjust down to much, slack algorithm over estimating - Periodicity required

shorter intervals look like congestion→slack
Longer intervals looks like the NIC can dequeue more→ limit too low

- Retiring when exiting NAPI may lead to excessively long interval - Limit based on load and timing - Possibility of using NIC generated timestamps to make algorithm more precise

Data Direct I/O

John Ronciak

- Idea to place RX data directly in cache - I/OAT had DCA, but that required specialized HW - 10G Ethernet is really driving this - Benefits are lower power consumption, lower latency - Have ran workloads with significant cache pressure - Question: How well does this work with multiple sockets - Based on Non-Uniform I/O Architecture - Going across socket, losing all the benefit (working out the details to deal with this)

Forcing affinity

- Don’t use user-space IRQ balancer

Potentially use a new kernel mode IRQ balancer

- No relationship to Inifiniband, but it is somewhat similar

Ethernet Audio/Video Bridging (AVB)

Eric Mann

- Proof-of-Concept Audio Video Bridging device using IEEE 802.1Qav - Solution for streaming media - Demo showing fundamental operation - Audio latency is sound propagation (ears do echo cancellation at about 10 msecs.) - Synchronization between audio and video (time synchronization) - Auto companies want to use AVB to reduce cost - More complicated than synchronized multicast - Stream reservation protocols (802.1Qat protocols) - 802.1Qav: Hardware-based traffic shapers (need to be very precise) - 802.1AS: TIme synchronization - Streams (video, audio) separated into classes - Hardware implements transmission selection scheme - Gap 1 in Linux: Want to say transmit this packet at time X (not just get a timestamp) - Use cmsg for timed TX - Gap 2 in Linux: ethtool options to display configure AVB state

TCP Congestion manager

Yuchung Cheng

- Everybody familiar with slow start - nytimes.com: Downloading a page, 53 conns, 195 reqs, 2MB - TCP used for web, very different from bulk transfer– it is bursty - Every TCP (re)measures the network - Server side congestion manager (some already implemented), cache certain parameters for connections - Server side approach behind low balancer loses information. - Client IP might imply different paths, client IPs can change because of NAT - Client could be hub of its connections - Great snipe - client knows what’s important, can do prioritization - For instance, could set initial cwnd as it wants - Aggregate window division (macroflow_cwnd/active_flows) - Congestion signals on receiver, receiver can give server a cwnd - Implemented on Android, ~1300 LOC - Two congestion control algorithms being experimented - Graphs should that Great Snipe benefits from caching congestion state - Great Snipe helped nytimes download only a little bit. Why? Browser didn’t use received data because it’s waiting for more important things. Need to work with browsers on this. - This is still experimental - Middle box interactions are a headache

TCP Loss Probe (TLP)

Nandita Dukkipati

- Talking about 10% of connections that experience losses - TLP: Convert RTOs to fast recovery - TImeout recovery is 10-100x longer compared to fast recovery - Tail losses are the majority - Retransmit last segment in 2 RTT to trigger SACK information and fast recovery - Loss probe kicks in when sender has nothing more to send. - Probe forces SACK and recovery - TLP give 6% avg. latency reduction in HTTP experiments

TCP Forward Error Correction

Nandita Dukkipati

- Want zero RTT loss recovery - Need significant impact on tail latency - Want low cost and simple design - Components: Encoding scheme, middlebox issues, FEC integration into TCP - Encoding uses simple XOR based encoding - Use interleaved FEC to protect against back to back losses - Done on MSS sized blocks (not packet boundaries) - Assuming at most 2 back to back losses in a train, this allows recovering over 60% of loss patterns - Want receiver to know when it saw a lost, but recovered it. Mechanism very similar to ECN. - Middle boxes: Potentially can muck with fields options, dislike not seeing all packets on a connection - Experimenting with FEC, pursue IETF standardization

Multipath TCP

Christoph Paasch

- Networks are becoming multi-path - Wireless: Increase performance by using wifi and 3G at same time - TCP does not support this - Data center: same issues, want to use multiple paths for load balancing - Solution of multiple path TCP - Work over existing Internet - Middleboxes blocks packets for connections it doesn’t understand - Need to use separate TCP connections - MP_JOIN option - Each subflow has own cwnd and sequence number space - Use a separate sequence number space - Additional sequence number is in TCP option space - Subflows can be created/destroyed dynamically - 10K LOC in TCP stack - Intercepts in common TCP functions - Two level hierarchy of sockets - Challenges, sequence numbers increase size of skb - Want to be transparent to userspace - Question: Handling socket options, how to handle this? Many functions that would potentially be affected by changes - Complex patch, need to split into one - Note: Congestion sharing principle is common with congestion manager, maybe some common elements

Linux Traffic classification and Shaping

John Fastabend

- Talking about multiQ - Many queues available - Possibly consolidate with open-vswitch code - HW Qos, coming, SW QoS is not great - No way to map flows to qdisc - Idea: Map skb to qdisc, implement HW QoS for queues - mqprio uses skb→priority to steer packets to queues - Ability to map flows to queues, different qdiscs - Map to HW queues - Queues that are multiQ aware are much better - Question: do we need to map by more than skb→priority - Hard to use qdiscs with global state - Question: How can we rectify this with XPS - Question:Can this kill select queue?

Interface to hardware transmit rate limiting

Jesse Brandeburg

- Idea: put rate limit into a sysfs variable - One rate limit value per queue - Do we need more hierarchy - Note: Use case that applications/VMs are rate limited, not necessarily - Question: is putting more in sysfs right approach - Question: can we define classes which describe QoS properties, and then link queues to these classes

Harmonizing Multiqueue, Vmdq, virtio-net, macvtap with open-vswitch

Shyam Iyer

- Optimized traffic flow via e-switch - Various use cases - macvtap use cases that improve performance - Devices only seem to be doing L2 in embedded switch - Question: How to do QoS in this? - Reference: linux-kvm,org/page/Multiqueue - Rate limit per queue, use rate limit - Openvswitch used tc for tagging - Different flow tables for tables - Not all adapters are multiqueue aware - macvtap is good because it simplifies interface to lower device - Use case of having multiple MAC addresses in a VM

Contact

Proposal added by therbert@google.com

2012/networking-bufferbloat.txt · Last modified: 2012年08月31日 22:18 by 216.239.45.130

Except where otherwise noted, content on this wiki is licensed under the following license:CC Attribution-Noncommercial-Share Alike 3.0 Unported

Recent changes RSS feed Donate Powered by PHP Valid XHTML 1.0 Valid CSS Driven by DokuWiki