customer_support · saas · workflow
How ElevenLabs engineered RAG to be 50% faster with model racing
ElevenLabs built RAG directly into every query for consistent accuracy, but the query rewriting step relied on a single externally-hosted LLM, creating a hard latency dependency that accounted for more than 80% of total RAG latency.
How it works
Common implementation structure
How this type of workflow is generally built, generalized across documented cases — not tied to any one vendor's stack. Click any stage to read what happens there. Specific products that implement these stages appear in “Tools commonly seen” below.
Stage 1 · Query enters pipeline
RAG is built directly into the request pipeline and runs on every query.
Tools used
RAGQwen 3-4B
Outcome
Model racing cut median RAG latency from 326ms to 155ms, with p75 dropping from 436ms to 250ms and p95 from 629ms to 426ms, while provider outages no longer interrupted conversations.
What failed first
The prior architecture's single externally-hosted LLM for query rewriting was vulnerable to peak-demand slowdowns and provider outages, making the system fragile.
Results
Time saved326ms → 155ms
Volumemore than 80% of RAG latency
Grounding & classification
Source type: technical build writeup
18 fields verified against source quotes.
agentic workflowconversational airagknowledge basemetric backedproduction runtime claimedtools describedworkflow describedsoftwareresponse time reductiontechnical build writeupcustomer supportrag answering