Workflow · saas · workflow

Cost Effective Deployment of DeepSeek R1 with Intel® Xeon® 6 CPU on SGLang

DeepSeek R1's massive model size and unique MoE architecture normally requires many high-end AI accelerators to deploy; Intel PyTorch Team proposed a CPU-only solution at fractional cost as an alternative.

How it works
Common implementation structure
How this type of workflow is generally built, generalized across documented cases — not tied to any one vendor's stack. Click any stage to read what happens there. Specific products that implement these stages appear in “Tools commonly seen” below.
Stage 1 · Prompt split into prefix and extend
SGLang divides the query sequence into a prefix part and an extend part before processing.
Tools used
SGLangIntel® Advanced Matrix Extensions (AMX)llama.cppPyTorchAVX512KTransformers
Outcome

The optimized SGLang CPU backend achieves substantially faster LLM inference on Intel Xeon CPUs, with memory bandwidth efficiency of 85% for INT8 MoE and significantly reduced time-to-first-token and time-per-output-token compared to llama.cpp; the work has been upstreamed into the SGLang main branch.

What failed first

Existing CPU tools like llama.cpp processed MoE experts sequentially rather than in parallel, leading to substantially slower inference for large MoE models.

Results
Time saved3%
Volume6-14x
Source

https://lmsys.org/blog/2025-07-14-intel-xeon-optimization/

How we source this →

Grounding & classification
Source type: technical build writeup
29 fields verified against source quotes.
failure mode describedmetric backedproduction runtime claimedsource backedtools describedworkflow describedsoftwarecost reductioncycle time reductionthroughput increasetechnical build writeup