back_office_ops · saas · workflow

Snowflake achieves 16x embedding inference throughput improvement with Arctic Inference optimizations

When Snowflake profiled embedding models running on vLLM, GPU utilization was far worse than a PyTorch-native implementation could achieve, with the embed function accounting for only 10% of compute time while 90% was spent on CPU tasks from tokenization and serialization bottlenecks.

How it works
Common implementation structure
How this type of workflow is generally built, generalized across documented cases — not tied to any one vendor's stack. Click any stage to read what happens there. Specific products that implement these stages appear in “Tools commonly seen” below.
Stage 1 · Embedding request via gRPC
Embedding inference requests with prompts as strings arrive at the vLLM service behind a gRPC frontend.
Tools used
vLLMArctic InferencegRPCNumPysnowflake-arctic-embed-m-v1.5Text Embeddings Inference (TEI)Cortex AICortex Search
Outcome

After three optimizations—little-endian byte serialization, disaggregated tokenization, and multi-replica GPU execution—Snowflake achieved 16x higher throughput for short sequences and 4.2x for long sequences versus vLLM, a 3x improvement in Snowflake Cortex AI delivering 230,000 tokens per second, and 16x cost savings versus vLLM on H200 hardware.

What failed first

vLLM's sequential tokenization-then-inference design left the GPU idle during tokenization, and Python Protobuf serialization over gRPC lacked SIMD vectorization and suffered from GIL contention, together consuming 90% of total processing time.

Results
Time saved10%
Volume16x
Cost replaced16x
Running sinceat time of publishing (A10g, production)
Source

https://www.snowflake.com/en/engineering-blog/embedding-inference-arctic-16x-faster/

How we source this →

Grounding & classification
Source type: technical build writeup
33 fields verified against source quotes.
enterprise searchfraud detectionrecommendation systemknowledge basefailure mode describedmetric backednamed customerproduction runtime claimedtools describedworkflow describedsoftwarecost reductioncycle time reductionthroughput increasetechnical build writeupback office opsrag answering