recruiting · saas · workflow

Scaling LLM-based prefill-only ranking systems with SGLang at LinkedIn

LinkedIn's LLM-based ranking workloads for AI Job Search and AI People Search faced high latency and low throughput because existing serving infrastructure was optimized for generative LLMs rather than prefill-only scoring, causing sequential tokenization, fragmented batch execution, unnecessary decode loops, and strict SLA pressure.

How it works

Common implementation structure

How this type of workflow is generally built, generalized across documented cases — not tied to any one vendor's stack. Click any stage to read what happens there. Specific products that implement these stages appear in “Tools commonly seen” below.

Stage 1 · Member query triggers ranking

A member query arrives and triggers scoring of hundreds of items.

Tools used

SGLangZMQgRPCH100 GPUs

Outcome

Through staged optimizations to SGLang, text-based ranking throughput increased approximately 3x from 750 to 2,200 items/s/GPU and P99 latency for the scoring path dropped from 6220 ms to 454 ms (13.7x), with the system now powering AI Job Search and AI People Search for millions of LinkedIn members.

What failed first

Multiple specific failure modes existed in the default SGLang serving path: batch boundaries were lost in ZMQ socket transmission causing fragmented GPU execution; the full decode and sampling loop ran unnecessarily for ranking; per-query prefix KV was recomputed for every candidate item; and Python GC stalls caused 100–300 ms pauses under sustained load.

Results

Time saved4583 ms to 464 ms

Volume41.5%

Source

https://www.linkedin.com/blog/engineering/ai/scaling-llm-based-ranking-systems-with-sglang-at-linkedin

How we source this →

Grounding & classification

Source type: technical build writeup

28 fields verified against source quotes.

enterprise searchrecommendation systemresumefailure mode describedmetric backednamed customerproduction runtime claimedsource backedtools describedworkflow describedsoftwarecycle time reductionresponse time reductionthroughput increasetechnical build writeuprecruitingextract classify route