Workflow · workflow

Journey to 3200 Gbps: High-Performance GPU Memory Transfer on AWS SageMaker HyperPod

Perplexity needed to efficiently transfer non-contiguous GPU memory regions between machines at maximum possible speed on AWS p5 instances, while supporting dynamic node addition and removal without disrupting operations. NCCL, the de facto standard library, was unsuitable because it requires a static cluster world and uses a synchronous communication model incompatible with their asynchronous workload.

How it works

Common implementation structure

How this type of workflow is generally built, generalized across documented cases — not tied to any one vendor's stack. Click any stage to read what happens there. Specific products that implement these stages appear in “Tools commonly seen” below.

Stage 1 · Basic SEND/RECV transfer

Development began with implementing basic unidirectional message transfer using SEND/RECV.

Tools used

libfabricEFARDMAGPUDirect RDMANCCLKubernetes

Outcome

Perplexity's custom libfabric-based RDMA solution achieved 3,108 Gbps — 97.1% of the theoretical 3,200 Gbps maximum — across all network cards on AWS p5 instances.

What failed first

NCCL was not ideal: it requires a static cluster world causing a full cluster restart when nodes change, its synchronous model added complexity for an asynchronous workload, and it did not permit direct control over memory transfer patterns.

Results

Volume3,108 Gbps

Source

https://www.perplexity.ai/hub/blog/high-performance-gpu-memory-transfer-on-aws

How we source this →

Grounding & classification

Source type: technical build writeup

17 fields verified against source quotes.

failure mode describedmetric backednamed customertools describedworkflow describedsoftwarethroughput increasetechnical build writeup