quality_assurance · saas · workflow

Evaluating AI at Scale: How Thumbtack Approaches Reliability, Safety, and Quality in GenAI

Generative AI outputs are probabilistic and capable of subtle errors in tone, accuracy, and safety, making evaluation uniquely challenging. Thumbtack's early decentralized evaluation approach — where individual product teams ran their own evaluations — created duplicated effort and siloed learnings as AI features multiplied across the company.

How it works

Common implementation structure

How this type of workflow is generally built, generalized across documented cases — not tied to any one vendor's stack. Click any stage to read what happens there. Specific products that implement these stages appear in “Tools commonly seen” below.

Stage 1 · Curate representative datasets

Datasets reflecting realistic customer and pro interactions are curated and continually expanded as new use cases appear.

Tools used

MLflowDeepEvalMTurkDatabricksBigQuerygpt-4o-minigpt-4oPromptRefiner

Outcome

Thumbtack consolidated into a dedicated Evals team owning shared tooling, content guidelines, and human oversight, with the aspiration that every AI workflow ships with evaluation gates covering accuracy, trust, safety, latency, and in-product effectiveness.

What failed first

The initial decentralized evaluation model, in which each product team ran its own evaluations independently, became unsustainable as AI surfaces multiplied, producing duplicated effort and siloed learnings company-wide.

Results

Volume5%

Source

https://medium.com/thumbtack-engineering/evaluating-ai-at-scale-how-thumbtack-approaches-reliability-safety-and-quality-in-genai-f75d0211ac54

How we source this →

Grounding & classification

Source type: technical build writeup

30 fields verified against source quotes.

content generationenterprise searchquality inspectionsummarizationchat transcripthuman review describedmetric backednamed customerproduction runtime claimedsource backedtools describedworkflow describedsoftwareaccuracy improvementerror reductiontechnical build writeupmarketing opsquality assuranceai draft human approvalhuman review queuemonitor detect alert