back_office_ops · saas · workflow

Perplexity AI serves 435 million search queries a month using NVIDIA H100 GPUs, Triton Inference Server, and TensorRT-LLM

Perplexity AI's inference team faced increasing pressure to provision the hardware and software needed to serve hundreds of millions of AI-powered search queries each month while simultaneously balancing cost efficiency with optimal user experience.

How it works

Common implementation structure

How this type of workflow is generally built, generalized across documented cases — not tied to any one vendor's stack. Click any stage to read what happens there. Specific products that implement these stages appear in “Tools commonly seen” below.

Stage 1 · User search query arrives

Perplexity AI receives more than 435 million user search queries each month, with each query representing multiple AI inference requests.

Tools used

NVIDIA H100 Tensor Core GPUsNVIDIA Triton Inference ServerNVIDIA TensorRT-LLMKubernetes · partnerLlama 3.1CUDA kernels

Outcome

Perplexity AI serves more than 435 million queries per month across over 20 simultaneous AI models under strict SLAs, and saved approximately $1 million annually by self-hosting models for the Related-Questions feature on cloud-hosted NVIDIA GPUs rather than using third-party LLM provider APIs.

Results

Time savedmore than 435 million queries each month

Volumeover 20 AI models simultaneously

Cost replacedapproximately $1 million annually

Source

https://developer.nvidia.com/blog/spotlight-perplexity-ai-serves-400-million-search-queries-a-month-using-nvidia-inference-stack

How we source this →

Grounding & classification

Source type: platform led case

23 fields verified against source quotes, 1 dropped as unverifiable.

conversational aienterprise searchsummarizationknowledge basemetric backednamed customerproduction runtime claimedtools describedworkflow describedsoftwarecost reductionthroughput increaseplatform led caseback office opsextract classify route