What is the Zero-Copy Loopback Stack?

Last updated 2 weeks ago

"Your TCP packet didn't travel the world. It never even left your L1 cache."
– Someone who knows what sk_buff does

TL;DR;

  • The zero-copy loopback stack is Linux's internal optimization for sending data to yourself over TCP (usually via localhost or 127.0.0.1).
  • When the kernel sees both ends of a TCP connection live in the same process or host, it short-circuits the data path, bypassing: The NIC, The driver stack, Checksum offload, Actual IP routing.
  • In ideal cases, data goes from your userspace buffer to the receiver's buffer without any copy at all.
  • Result: near-RAM speed socket communication, in a POSIX-compliant API.

History: From syscalls to shared memory

Before this optimization, localhost TCP followed the full stack:

  • Send via write()
  • Packetized into sk_buff
  • Routed to the loopback interface (lo)
  • Re-assembled and delivered to the socket receive buffer

Multiple copies happen:

  • User → kernel (send)
  • Kernel → socket buffer
  • Socket buffer → kernel receive
  • Kernel receive → user (read())

That's 4 copies – for a local call! Completely unnecessary if both sockets are local.

So kernel devs asked: "What if we cheat?"

Enter: The Loopback Fast Path

Modern Linux (since ~3.10+, improved through 5.x) includes an internal optimization:

If both endpoints of a TCP socket are local and on the loopback interface, and certain conditions are met:

  • No IPsec
  • No netfilter (iptables)
  • No QoS
  • No congestion control tricks

... then the kernel will internally pass the data directly from sender socket to receiver socket, bypassing the entire IP stack.

What Does Zero-Copy Actually Mean?

You probably heard of:

  • mmap()
  • splice()
  • sendfile()
  • io_uring

These are user-to-kernel zero-copy techniques.

Loopback zero-copy is even more fundamental:

It's in-kernel socket-to-socket zero-copy.
No user interaction. No syscalls.
Data is enqueued in one socket and instantly dequeued from the other.

Code Example: How to Trigger It

This is all you need:

int listener = socket(AF_INET, SOCK_STREAM, 0);
bind(listener, ...127.0.0.1...);
listen(listener, 1);
int client = socket(AF_INET, SOCK_STREAM, 0);
connect(client, ...127.0.0.1...);
int server = accept(listener, ...);
// Now client <--> server is a local TCP pair

Then:

char msg[] = "zero-copy!";
write(client, msg, sizeof(msg));
char buf[32];
read(server, buf, sizeof(buf));

If your kernel is smart (Linux 4.16+), this write() and read() are zero-copy from one socket buffer to another.

Kernel Path Dissection

Internally, Linux checks this path:

  • tcp_sendmsg(): The core TCP send routine
  • Detects: loopback → skips device output
  • sk->sk_data_ready on the receiving socket fires immediately
  • Data directly passed to peer's receive queue (sk_receive_queue)
  • If receiver is blocked in recv(), it gets woken instantly
  • Data appears in user buffer without ever being "routed"

Check this file in the kernel:

net/ipv4/tcp_loopback.c

You'll find logic like:

if (dst->dev->flags & IFF_LOOPBACK) {
 // short-circuit the stack
}

Performance? It's Insane.

Let's benchmark:

iperf3 -c 127.0.0.1

You'll see:

[ ID] Interval Transfer Bandwidth
[ 5] 0.00-1.00 sec 11.2 GBytes 96.3 Gbits/sec

Yes, 96 Gbit/sec over TCP.

No, you're not imagining it.

Deep Mode: Using splice() Over Loopback

Try this:

int pipefds[2];
pipe(pipefds);
splice(filefd, NULL, pipefds[1], NULL, 65536, SPLICE_F_MOVE);
splice(pipefds[0], NULL, sockfd, NULL, 65536, SPLICE_F_MOVE);

If sockfd is a loopback TCP socket and kernel supports it, this goes:

  • From file → kernel page cache
  • Into pipe buffer
  • Into the peer socket
  • All without copying

It's zero-copy loopback + zero-copy syscall = kernel-magic streaming.

Zero-Copy in io_uring

Newer Linux (5.6+) supports:

io_uring_prep_send_zc()

With loopback optimization, this means:

  • App never copies
  • Kernel never copies
  • Data path = DMA → socket queue → peer buffer → syscall complete

You're basically writing userland TCP that beats most RPC frameworks.

Security Implication: Same-Host Visibility

Because data never leaves the host, loopback sockets can:

  • Avoid TLS (if you dare)
  • Rely on Unix domain security
  • Skip MTU fragmentation logic
  • Skip route tables, NAT, IP rules

This makes them ideal for internal microservice RPC, e.g., using gRPC over loopback.

When Zero-Copy Loopback Fails

It degrades to regular copy path if:

  • You use iptables rules on lo
  • You set tcp_checksum=1
  • You route via different VRF or namespace
  • You add congestion control modules (bbr, cubic)
  • You mix blocking and non-blocking socket flags incorrectly

Run strace, use perf, use tcpdump – and verify you're not hitting the slow path.

Experimental: Building In-Memory TCP via veth

Want to emulate loopback across two namespaces?

  1. Create veth0 <--> veth1
  2. Assign each to a netns
  3. Use TCP_NODELAY, TCP_NOTSENT_LOWAT, and TCP_CORK to simulate zero-copy over shared memory

It's not quite loopback-fast, but lets you measure stack behavior under controlled topologies.

Why It Matters: Rethinking IPC

Most people use:

  • Unix domain sockets
  • Named pipes
  • Shared memory
  • gRPC over HTTP/2

But:

TCP over loopback with zero-copy beats all of them in flexibility + performance.

You get:

  • Stream semantics
  • POSIX compliance
  • Congestion control
  • No serialization step
  • Kernel-managed queues
  • Transparent upgrade path to real TCP

It's basically shared memory with TCP semantics.

Final Thoughts

The zero-copy loopback stack is the ultimate example of kernel optimization: a place where network semantics and memory locality collide, and the kernel says:

"Oh, you're just talking to yourself? Fine. I won't even hit RAM."

It gives you:

  • Wire-speed localhost RPC
  • Real TCP semantics, with none of the NIC pain
  • A fast path through the kernel that mimics what RDMA and DPDK do, without leaving userspace.

You're not sending packets anymore.

You're passing pointers inside the kernel.

Further Reading

  • Linux Kernel: tcp_output.c, tcp_loopback.c, tcp_write_xmit()
  • perf record + perf trace on 127.0.0.1
  • Use strace -e trace=network -T to see syscall timings
  • io_uring man pages

Closing

Want more? I can write about:

  • How loopback interacts with epoll()
  • Benchmarking AF_UNIX vs loopback
  • Using TCP_FASTOPEN over localhost

Say the word. We'll go even deeper.

Read more articles →
Thanks for reading my post. If you enjoyed it and would like to receive my posts automatically, you can subscribe to new posts via rss feed or email.
UK Home Office is now a Browserling customer!
Announcing onlineINTEGERtools.com

AltStyle によって変換されたページ (->オリジナル) /