back_office_ops · workflow

Nextdoor CoreML team achieves 4x latency reduction and 3x throughput increase by tuning ML inference in shared hosting environments

Running ML inference in shared hosting environments (ECS, K8s) introduces unobvious pitfalls that significantly impact latency and throughput.

How it works
Common implementation structure
How this type of workflow is generally built, generalized across documented cases — not tied to any one vendor's stack. Click any stage to read what happens there. Specific products that implement these stages appear in “Tools commonly seen” below.
Stage 1 · ML inference request arrives
ML inference requests arrive at a shared hosting environment such as ECS or K8s.
Tools used
ECSK8sOpenMP
Outcome

After addressing request queue management and OpenMP parameter tuning, Nextdoor achieved a factor of 4 latency reduction, 3x throughput increase, and CPU utilization improvement from 10% to 50% while maintaining model performance.

What failed first

Nextdoor's ML team experienced latency and throughput degradation traced to poor request queue management and suboptimal OpenMP parameter configuration before discovering and resolving the root causes.

Results
Time saved30+
Volumefactor of 4
Source

https://engblog.nextdoor.com/running-ml-inference-services-in-shared-hosting-environments-6176b39bc9b7

How we source this →

Grounding & classification
Source type: technical build writeup
19 fields verified against source quotes.
builder submittedfailure mode describedmetric backednamed customerproduction runtime claimedtools describedworkflow describedsoftwarecycle time reductionthroughput increasetechnical build writeupback office ops