When Snowflake profiled embedding models running on vLLM, GPU utilization was far worse than a PyTorch-native implementation could achieve, with the embed function accounting for only 10% of compute time while 90% was spent on CPU tasks from tokenization and serialization bottlenecks.
How it works
Common implementation structure
How this type of workflow is generally built, generalized across documented cases — not tied to any one vendor's stack. Click any stage to read what happens there. Specific products that implement these stages appear in “Tools commonly seen” below.
Stage 1 · Embedding request via gRPC
Embedding inference requests with prompts as strings arrive at the vLLM service behind a gRPC frontend.
After three optimizations—little-endian byte serialization, disaggregated tokenization, and multi-replica GPU execution—Snowflake achieved 16x higher throughput for short sequences and 4.2x for long sequences versus vLLM, a 3x improvement in Snowflake Cortex AI delivering 230,000 tokens per second, and 16x cost savings versus vLLM on H200 hardware.
What failed first
vLLM's sequential tokenization-then-inference design left the GPU idle during tokenization, and Python Protobuf serialization over gRPC lacked SIMD vectorization and suffered from GIL contention, together consuming 90% of total processing time.
Results
Time saved10%
Volume16x
Cost replaced16x
Running sinceat time of publishing (A10g, production)