back_office_ops · saas · workflow
Perplexity AI serves 435 million search queries a month using NVIDIA H100 GPUs, Triton Inference Server, and TensorRT-LLM
Perplexity AI's inference team faced increasing pressure to provision the hardware and software needed to serve hundreds of millions of AI-powered search queries each month while simultaneously balancing cost efficiency with optimal user experience.
How it works
Common implementation structure
How this type of workflow is generally built, generalized across documented cases — not tied to any one vendor's stack. Click any stage to read what happens there. Specific products that implement these stages appear in “Tools commonly seen” below.
Stage 1 · User search query arrives
Perplexity AI receives more than 435 million user search queries each month, with each query representing multiple AI inference requests.
Tools used
NVIDIA H100 Tensor Core GPUsNVIDIA Triton Inference ServerNVIDIA TensorRT-LLMKubernetes · partnerLlama 3.1CUDA kernels
Outcome
Perplexity AI serves more than 435 million queries per month across over 20 simultaneous AI models under strict SLAs, and saved approximately $1 million annually by self-hosting models for the Related-Questions feature on cloud-hosted NVIDIA GPUs rather than using third-party LLM provider APIs.
Results
Time savedmore than 435 million queries each month
Volumeover 20 AI models simultaneously
Cost replacedapproximately $1 million annually
Grounding & classification
Source type: platform led case
23 fields verified against source quotes, 1 dropped as unverifiable.
conversational aienterprise searchsummarizationknowledge basemetric backednamed customerproduction runtime claimedtools describedworkflow describedsoftwarecost reductionthroughput increaseplatform led caseback office opsextract classify route