Scaling LLM-based prefill-only ranking systems with SGLang at LinkedIn
LinkedIn's LLM-based ranking workloads for AI Job Search and AI People Search faced high latency and low throughput because existing serving infrastructure was optimized for generative LLMs rather than prefill-only scoring, causing sequential tokenization, fragmented batch execution, unnecessary decode loops, and strict SLA pressure.
Through staged optimizations to SGLang, text-based ranking throughput increased approximately 3x from 750 to 2,200 items/s/GPU and P99 latency for the scoring path dropped from 6220 ms to 454 ms (13.7x), with the system now powering AI Job Search and AI People Search for millions of LinkedIn members.
Multiple specific failure modes existed in the default SGLang serving path: batch boundaries were lost in ZMQ socket transmission causing fragmented GPU execution; the full decode and sampling loop ran unnecessarily for ranking; per-query prefix KV was recomputed for every candidate item; and Python GC stalls caused 100–300 ms pauses under sustained load.