Skip to main content
Registration is now open! Early-bird pricing available through May 5, 2026. Register now

All Accepted Papers

Dissecting and Improving Communication Performance in Multi-Node LLM Inference

Prajwal Singhania (University of Maryland), Siddharth Singh (University of Maryland), Lannie Dalton Hough (University of Maryland), Akarsh Srivastava (University of Maryland), Harshitha Menon (Lawrence Livermore National Laboratory), Charles Fredrick Jekel (Lawrence Livermore National Laboratory), Abhinav Bhatele (University of Maryland)

System Optimization & Efficiency

Abstract

As large language models (LLMs) continue to grow in size, distributed inference has become increasingly important. Model-parallel strategies must now efficiently scale not only across multiple GPUs but also across multiple nodes. In this work, we present a detailed performance study of multi-node distributed inference using LLMs on GPU-based supercomputers. We conduct experiments with several state-of-the-art inference engines alongside YALIS, a research-oriented prototype engine designed for controlled experimentation. We analyze the strong-scaling behavior of different model-parallel schemes and identify key bottlenecks. Because all-reduce operations are a common performance bottleneck, we develop NVRAR, a hierarchical all-reduce algorithm based on recursive doubling with NVSHMEM. NVRAR achieves up to 1.9$\times$-3.6$\times$ lower latency than NCCL for message sizes between 128\,KB and 2\,MB on HPE Slingshot and InfiniBand interconnects. Integrated into YALIS, NVRAR achieves up to a 1.72$\times$ reduction in end-to-end batch latency for the Llama 3.1 405B model in multi-node decode-heavy workloads using tensor parallelism.

ACM CAIS 2026 Sponsors