Pyrmont Brewery NICShaders: GPU Compute Shaders for High-Throughput HTTP Networking
Pyrmont Brewery NICShaders is pioneering the use of GPU compute shaders to accelerate HTTP traffic processing, bypassing traditional Linux kernel and CPU bottlenecks. This approach is designed for CDN providers and large-scale web infrastructure, enabling micro-tasks such as video timestamp manipulation, C2PA authentication, watermarking, and dynamic HTTP header modification—all at wire speed.
Why Kernel Bypass Matters
While the Linux kernel networking stack offers optimizations (BBR, cubic, increased MTU, buffer tuning), it ultimately relies on CPU processing, which limits throughput. For CDN and edge compute, the ability to process packets directly—without kernel or CPU intervention—unlocks new performance levels.
Can GPU Compute Shaders Output HTTP Directly to NIC?
Yes—by combining advanced technologies, NICShaders enables direct data transfer from GPU compute shaders to NICs for HTTP output, bypassing the kernel:
- GPU Compute Shader Processing: Incoming network data is processed by CUDA/OpenCL shaders.
- Direct DMA Transfer: Modified data is written to GPU memory buffers, then transferred via DMA to NIC TX descriptors.
- NIC Transmission: NIC hardware transmits HTTP-formatted packets directly from GPU memory.
Key Technologies
- GPU-Ether Framework: Enables direct NIC-to-GPU DMA, mapping RX/TX descriptors and doorbell registers into GPU memory for one-to-one thread/descriptor manipulation and batch optimization.
- GPUDirect RDMA: Allows third-party PCIe devices (NICs) to access GPU memory directly, provided both share the same PCIe root complex. Zero-copy is achieved by pointing NIC TX descriptors to GPU buffers.
- DPDK with GPU Integration: DPDK’s gpudev library supports GPU packet processing, direct NIC access, and low-latency CPU-GPU communication for user-space HTTP protocol handling.
- Direct NIC Programming: User-space I/O and flow steering rules enable complete kernel bypass for selected traffic flows.
Implementation Approach
- GPU Compute Shader Processing: Use CUDA/OpenCL to process packets, writing results to mapped GPU memory buffers.
- HTTP Protocol Handling: Format HTTP responses in GPU memory or via user-space HTTP stacks.
- Direct Transmission: Map GPU buffers to NIC TX descriptors, trigger transmission via doorbell register access.
- Zero-Copy Optimization: Use pinned memory allocation (cudaHostAllocMapped) for direct NIC access to GPU buffers.
Current Limitations
- PCIe topology: GPU and NIC must share the same root complex.
- Driver support: Requires GPUDirect RDMA-enabled NIC drivers.
- Software complexity: Custom drivers and user-space protocol stacks needed.
Practical Path Forward
- Leverage NVIDIA DOCA GPUNetIO for production deployments.
- Implement DPDK-based user-space HTTP servers with GPU integration.
- Use kernel-bypass stacks (mTCP, TAS, F-Stack) for HTTP protocol handling.
- Integrate with GPUDirect RDMA-enabled NICs for direct GPU-NIC communication.
This approach enables NICShaders to deliver ultra-high-throughput, low-latency HTTP networking for modern CDN and edge compute workloads, pushing the boundaries of what’s possible in beer brewing and beyond.