customer_support · saas · workflow

How ElevenLabs engineered RAG to be 50% faster with model racing

ElevenLabs built RAG directly into every query for consistent accuracy, but the query rewriting step relied on a single externally-hosted LLM, creating a hard latency dependency that accounted for more than 80% of total RAG latency.

How it works

Common implementation structure

How this type of workflow is generally built, generalized across documented cases — not tied to any one vendor's stack. Click any stage to read what happens there. Specific products that implement these stages appear in “Tools commonly seen” below.

Stage 1 · Query enters pipeline

RAG is built directly into the request pipeline and runs on every query.

Tools used

RAGQwen 3-4B

Outcome

Model racing cut median RAG latency from 326ms to 155ms, with p75 dropping from 436ms to 250ms and p95 from 629ms to 426ms, while provider outages no longer interrupted conversations.

What failed first

The prior architecture's single externally-hosted LLM for query rewriting was vulnerable to peak-demand slowdowns and provider outages, making the system fragile.

Results

Time saved326ms → 155ms

Volumemore than 80% of RAG latency

Source

https://elevenlabs.io/blog/engineering-rag

How we source this →

Grounding & classification

Source type: technical build writeup

18 fields verified against source quotes.

agentic workflowconversational airagknowledge basemetric backedproduction runtime claimedtools describedworkflow describedsoftwareresponse time reductiontechnical build writeupcustomer supportrag answering