Pinterest Feature Trimmer reduces root-leaf ML serving network bandwidth and saves over $4M annually
Pinterest's root-leaf ML serving architecture passed the full union of ML features from root to every leaf partition regardless of which features each model actually needed, creating a network bandwidth bottleneck that forced infrastructure scaling based on network utilization rather than compute and left GPU resources underutilized.
How it works
Common implementation structure
How this type of workflow is generally built, generalized across documented cases — not tied to any one vendor's stack. Click any stage to read what happens there. Specific products that implement these stages appear in “Tools commonly seen” below.
Stage 1 · Score request from client service
Client service sends a score request to the online ML serving root to have candidate Pins scored by ML models for relevancy.
Tools used
fbthriftlz4TorchScriptPyTorchGFlags
Outcome
Feature Trimmer saved over $4M in annual infrastructure costs at Pinterest, enabled a 27% Ads root cluster downsizing, reduced the Homefeed root cluster fleet by 33%, achieved roughly 45% and 65% egress drops for Search and Notification clusters, and improved Related Pins p99 latency by about 25–30%.
What failed first
Enabling lz4 compression in fbthrift reduced root-leaf network usage by 20% but at the cost of 5% more CPU and a 5ms (~10%) p90 latency increase, and did not address the underlying problem of transmitting unused features.