Netflix optimizes Ranker serendipity scoring CPU by ~7% using JDK Vector API batching
Netflix's Ranker service had a CPU hotspot in serendipity scoring — the logic that measures how different a candidate title is from a member's viewing history. The original O(M×N) per-pair cosine similarity loop consumed about 7.5% of total CPU per node due to sequential work, repeated embedding lookups, and poor cache locality.
With batching, flat buffers, ThreadLocal reuse, and the JDK Vector API in place, Netflix achieved a ~7% drop in CPU utilization, a ~12% drop in average latency, and a ~10% improvement in CPU per request-per-second. The serendipity encoder's share of CPU fell from 7.5% to ~1%.
An initial batching attempt caused a ~5% performance regression because double[][] matrices created GC pressure and non-contiguous memory hurt cache behavior. A subsequent BLAS integration failed to deliver gains in production due to the F2J fallback, JNI overhead, and a row-major vs. column-major layout mismatch.
https://netflixtechblog.com/optimizing-recommendation-systems-with-jdks-vector-api-30d2830401ec