In this track we want to discuss topics related to bufferbloat and networking performance including latency reduction, congestion control, queue management, TCP algorithm enhancements, and related topics.
When submitting a proposal for this track, use the prefix "lpc2012-net-" for the "Name" field.
Submissions to the micro conference:
The schedule of the 2012 networking and bufferbloat Plumbers Micro Conference will be posted...
Notes for Networking MC Linux Plumbers Conference 2012
CoDel and FQ CoDel
Eric Dumazet
- Queuing is needed and necessary - What size of queue do we want - What about controlling delays - Idea is to control delay in a given interval - Want a simple control loop - Codel works great, numbers with fq_codel are great - Question: Should we replace pfifo_fast with Codel? - Noted that Codel does not burn too much CPU, shows we don’t need to be stuck with fifos forever - Question: Can we pull wireless complexity into the host (wireless device implement complex somewhat arcane packet scheduling on their own)
Byte Queue Limits revisited
Tomas Hruby
- Motivation for BQL: limit queuing in the NIC - Set right limit in the NIC - Based on bytes, not number of packets - Dynamic and adaptive algorithm - Completion event needs to be periodic - Two conditions to increase limit, based on history - Decreasing limiting, avoid hysteresis - BQL performance much better than default configuration, HW priority still better in some cases - Issue: Some cliffs where we adjust down to much, slack algorithm over estimating - Periodicity required
- Retiring when exiting NAPI may lead to excessively long interval - Limit based on load and timing - Possibility of using NIC generated timestamps to make algorithm more precise
Data Direct I/O
John Ronciak
- Idea to place RX data directly in cache - I/OAT had DCA, but that required specialized HW - 10G Ethernet is really driving this - Benefits are lower power consumption, lower latency - Have ran workloads with significant cache pressure - Question: How well does this work with multiple sockets - Based on Non-Uniform I/O Architecture - Going across socket, losing all the benefit (working out the details to deal with this)
- Don’t use user-space IRQ balancer
- No relationship to Inifiniband, but it is somewhat similar
Ethernet Audio/Video Bridging (AVB)
Eric Mann
- Proof-of-Concept Audio Video Bridging device using IEEE 802.1Qav - Solution for streaming media - Demo showing fundamental operation - Audio latency is sound propagation (ears do echo cancellation at about 10 msecs.) - Synchronization between audio and video (time synchronization) - Auto companies want to use AVB to reduce cost - More complicated than synchronized multicast - Stream reservation protocols (802.1Qat protocols) - 802.1Qav: Hardware-based traffic shapers (need to be very precise) - 802.1AS: TIme synchronization - Streams (video, audio) separated into classes - Hardware implements transmission selection scheme - Gap 1 in Linux: Want to say transmit this packet at time X (not just get a timestamp) - Use cmsg for timed TX - Gap 2 in Linux: ethtool options to display configure AVB state
TCP Congestion manager
Yuchung Cheng
- Everybody familiar with slow start - nytimes.com: Downloading a page, 53 conns, 195 reqs, 2MB - TCP used for web, very different from bulk transfer– it is bursty - Every TCP (re)measures the network - Server side congestion manager (some already implemented), cache certain parameters for connections - Server side approach behind low balancer loses information. - Client IP might imply different paths, client IPs can change because of NAT - Client could be hub of its connections - Great snipe - client knows what’s important, can do prioritization - For instance, could set initial cwnd as it wants - Aggregate window division (macroflow_cwnd/active_flows) - Congestion signals on receiver, receiver can give server a cwnd - Implemented on Android, ~1300 LOC - Two congestion control algorithms being experimented - Graphs should that Great Snipe benefits from caching congestion state - Great Snipe helped nytimes download only a little bit. Why? Browser didn’t use received data because it’s waiting for more important things. Need to work with browsers on this. - This is still experimental - Middle box interactions are a headache
TCP Loss Probe (TLP)
Nandita Dukkipati
- Talking about 10% of connections that experience losses - TLP: Convert RTOs to fast recovery - TImeout recovery is 10-100x longer compared to fast recovery - Tail losses are the majority - Retransmit last segment in 2 RTT to trigger SACK information and fast recovery - Loss probe kicks in when sender has nothing more to send. - Probe forces SACK and recovery - TLP give 6% avg. latency reduction in HTTP experiments
TCP Forward Error Correction
Nandita Dukkipati
- Want zero RTT loss recovery - Need significant impact on tail latency - Want low cost and simple design - Components: Encoding scheme, middlebox issues, FEC integration into TCP - Encoding uses simple XOR based encoding - Use interleaved FEC to protect against back to back losses - Done on MSS sized blocks (not packet boundaries) - Assuming at most 2 back to back losses in a train, this allows recovering over 60% of loss patterns - Want receiver to know when it saw a lost, but recovered it. Mechanism very similar to ECN. - Middle boxes: Potentially can muck with fields options, dislike not seeing all packets on a connection - Experimenting with FEC, pursue IETF standardization
Multipath TCP
Christoph Paasch
- Networks are becoming multi-path - Wireless: Increase performance by using wifi and 3G at same time - TCP does not support this - Data center: same issues, want to use multiple paths for load balancing - Solution of multiple path TCP - Work over existing Internet - Middleboxes blocks packets for connections it doesn’t understand - Need to use separate TCP connections - MP_JOIN option - Each subflow has own cwnd and sequence number space - Use a separate sequence number space - Additional sequence number is in TCP option space - Subflows can be created/destroyed dynamically - 10K LOC in TCP stack - Intercepts in common TCP functions - Two level hierarchy of sockets - Challenges, sequence numbers increase size of skb - Want to be transparent to userspace - Question: Handling socket options, how to handle this? Many functions that would potentially be affected by changes - Complex patch, need to split into one - Note: Congestion sharing principle is common with congestion manager, maybe some common elements
Linux Traffic classification and Shaping
John Fastabend
- Talking about multiQ - Many queues available - Possibly consolidate with open-vswitch code - HW Qos, coming, SW QoS is not great - No way to map flows to qdisc - Idea: Map skb to qdisc, implement HW QoS for queues - mqprio uses skb→priority to steer packets to queues - Ability to map flows to queues, different qdiscs - Map to HW queues - Queues that are multiQ aware are much better - Question: do we need to map by more than skb→priority - Hard to use qdiscs with global state - Question: How can we rectify this with XPS - Question:Can this kill select queue?
Interface to hardware transmit rate limiting
Jesse Brandeburg
- Idea: put rate limit into a sysfs variable - One rate limit value per queue - Do we need more hierarchy - Note: Use case that applications/VMs are rate limited, not necessarily - Question: is putting more in sysfs right approach - Question: can we define classes which describe QoS properties, and then link queues to these classes
Harmonizing Multiqueue, Vmdq, virtio-net, macvtap with open-vswitch
Shyam Iyer
- Optimized traffic flow via e-switch - Various use cases - macvtap use cases that improve performance - Devices only seem to be doing L2 in embedded switch - Question: How to do QoS in this? - Reference: linux-kvm,org/page/Multiqueue - Rate limit per queue, use rate limit - Openvswitch used tc for tagging - Different flow tables for tables - Not all adapters are multiqueue aware - macvtap is good because it simplifies interface to lower device - Use case of having multiple MAC addresses in a VM
Proposal added by therbert@google.com