Workflow · saas · workflow

Cost Effective Deployment of DeepSeek R1 with Intel® Xeon® 6 CPU on SGLang

DeepSeek R1's massive model size and unique MoE architecture normally requires many high-end AI accelerators to deploy; Intel PyTorch Team proposed a CPU-only solution at fractional cost as an alternative.

How it works

Common implementation structure

How this type of workflow is generally built, generalized across documented cases — not tied to any one vendor's stack. Click any stage to read what happens there. Specific products that implement these stages appear in “Tools commonly seen” below.

Stage 1 · Prompt split into prefix and extend

SGLang divides the query sequence into a prefix part and an extend part before processing.

Tools used

SGLangIntel® Advanced Matrix Extensions (AMX)llama.cppPyTorchAVX512KTransformers

Outcome

The optimized SGLang CPU backend achieves substantially faster LLM inference on Intel Xeon CPUs, with memory bandwidth efficiency of 85% for INT8 MoE and significantly reduced time-to-first-token and time-per-output-token compared to llama.cpp; the work has been upstreamed into the SGLang main branch.

What failed first

Existing CPU tools like llama.cpp processed MoE experts sequentially rather than in parallel, leading to substantially slower inference for large MoE models.

Results

Time saved3%

Volume6-14x

Source

https://lmsys.org/blog/2025-07-14-intel-xeon-optimization/

How we source this →

Grounding & classification

Source type: technical build writeup

29 fields verified against source quotes.

failure mode describedmetric backedproduction runtime claimedsource backedtools describedworkflow describedsoftwarecost reductioncycle time reductionthroughput increasetechnical build writeup