Evaluating AI at Scale: How Thumbtack Approaches Reliability, Safety, and Quality in GenAI
Generative AI outputs are probabilistic and capable of subtle errors in tone, accuracy, and safety, making evaluation uniquely challenging. Thumbtack's early decentralized evaluation approach — where individual product teams ran their own evaluations — created duplicated effort and siloed learnings as AI features multiplied across the company.
How it works
Common implementation structure
How this type of workflow is generally built, generalized across documented cases — not tied to any one vendor's stack. Click any stage to read what happens there. Specific products that implement these stages appear in “Tools commonly seen” below.
Stage 1 · Curate representative datasets
Datasets reflecting realistic customer and pro interactions are curated and continually expanded as new use cases appear.
Thumbtack consolidated into a dedicated Evals team owning shared tooling, content guidelines, and human oversight, with the aspiration that every AI workflow ships with evaluation gates covering accuracy, trust, safety, latency, and in-product effectiveness.
What failed first
The initial decentralized evaluation model, in which each product team ran its own evaluations independently, became unsustainable as AI surfaces multiplied, producing duplicated effort and siloed learnings company-wide.