back_office_ops · workflow

Nextdoor CoreML team achieves 4x latency reduction and 3x throughput increase by tuning ML inference in shared hosting environments

Running ML inference in shared hosting environments (ECS, K8s) introduces unobvious pitfalls that significantly impact latency and throughput.

How it works

Common implementation structure

How this type of workflow is generally built, generalized across documented cases — not tied to any one vendor's stack. Click any stage to read what happens there. Specific products that implement these stages appear in “Tools commonly seen” below.

Stage 1 · ML inference request arrives

ML inference requests arrive at a shared hosting environment such as ECS or K8s.

Tools used

ECSK8sOpenMP

Outcome

After addressing request queue management and OpenMP parameter tuning, Nextdoor achieved a factor of 4 latency reduction, 3x throughput increase, and CPU utilization improvement from 10% to 50% while maintaining model performance.

What failed first

Nextdoor's ML team experienced latency and throughput degradation traced to poor request queue management and suboptimal OpenMP parameter configuration before discovering and resolving the root causes.

Results

Time saved30+

Volumefactor of 4

Source

https://engblog.nextdoor.com/running-ml-inference-services-in-shared-hosting-environments-6176b39bc9b7

How we source this →

Grounding & classification

Source type: technical build writeup

19 fields verified against source quotes.

builder submittedfailure mode describedmetric backednamed customerproduction runtime claimedtools describedworkflow describedsoftwarecycle time reductionthroughput increasetechnical build writeupback office ops