Kernel bypass
Quick definition
Kernel bypass refers to an approach used to improve the performance of an application where hardware resources (like network or storage devices) are accessed directly, without going through the operating system's kernel.
What is Kernel bypass?
In the context of financial trading systems, kernel bypass usually refers to the use of userspace networking, which is one form of kernel bypass. Most often, on trading systems, kernel bypass implies the use of libraries such as Solarflare OpenOnload, AMD TCPDirect, EtherFabric Virtual Interface (ef_vi), or Mellanox VMA.
A typical reason for the use of kernel bypass is that the standard Linux kernel networking stack is usually not capable of handling traffic rates of over 1 million packets per second, which are frequently encountered on large exchanges during microbursts. Kernel bypass allows one to achieve lower latencies on software-based trading platforms built on commodity server hardware. It also allows one to achieve line rate processing at 10 Gbps or higher, avoiding packet loss.
Kernel bypass techniques may differ in implementation. The typical approach provides a way to skip the kernel's network stack and interact directly with the TX/RX buffers of network cards. This is usually complemented with zero-copy I/O, where packets are directly steered to and read from the application buffer, bypassing intermediate copies, which further reduces CPU overhead and memory bandwidth usage. Such buffers are also usually implemented as lock-free or wait-free ring buffers which ensure that separate producer and consumer threads don't block each other while accessing the buffer.
Other kernel bypass techniques include:
- Data Plane Development Kit (DPDK), an open source framework managed by the Linux Foundation. DPDK provides libraries and drivers that allow applications to directly interface with NICs, skipping the kernel's involvement in packet processing.
- Remote Direct Memory Access (RDMA), which allows one server to access the memory on another server over a network, bypassing the kernel on both ends.
- eXpress Data Path (XDP), a Linux kernel feature and a part of the extended Berkeley Packet Filter (eBPF) framework that allows user-level programs to attach directly to the NIC, avoiding much of the kernel's networking stack..
- PF_RING, a Linux kernel module developed by ntop.
ef_vi is a low-level, high-performance API provided by Solarflare (now part of AMD after its acquisition of Xilinx) that give applications direct and fine-grained control over Solarflare network interface cards (NICs). ef_vi is also called the Ethernet Frame Virtual Interface.
OpenOnload (or simply Onload) is a userspace network stack which accelerates TCP and UDP network for applications that use BSD sockets on Linux. Onload uses ef_vi under the hood, but the key advantage over ef_vi is that it maintains full compatibility with POSIX and existing socket-based applications, making it easier to integrate.
Onload comprises of a userspace shared library that implements the TCP and UDP protocol stack, and intercepts network syscalls and redirects them through said userspace stack instead.
TCPDirect is designed to serve a middle ground between Onload and ef_vi, providing much lower latency than Onload at the expense of a more limited feature set, while providing an interface that's easier to use than ef_vi.
Similar to Onload, TCPDirect relies on ef_vi under the hood. However, TCPDirect requires more integration effort than Onload as applications need to be adapted to use TCPDirect APIs rather than standard BSD sockets that are automatically intercepted by Onload. TCPDirect is generally easier to integrate than ef_vi, as its API still comprises of several analogs for the standard BSD sockets API.
References
-
Kernel bypass. Cloudflare Blog.
-
What is VMA?. NVIDIA.
-
OpenOnload. GitHub.
-
Knight, M.. (2016) "Introducing TCPDirect." Solarflare.