Journey to 3200 Gbps: High-Performance GPU Memory Transfer on AWS SageMaker HyperPod
Perplexity needed to efficiently transfer non-contiguous GPU memory regions between machines at maximum possible speed on AWS p5 instances, while supporting dynamic node addition and removal without disrupting operations. NCCL, the de facto standard library, was unsuitable because it requires a static cluster world and uses a synchronous communication model incompatible with their asynchronous workload.
Perplexity's custom libfabric-based RDMA solution achieved 3,108 Gbps — 97.1% of the theoretical 3,200 Gbps maximum — across all network cards on AWS p5 instances.
NCCL was not ideal: it requires a static cluster world causing a full cluster restart when nodes change, its synchronous model added complexity for an asynchronous workload, and it did not permit direct control over memory transfer patterns.