compliance_monitoring · saas · workflow

Thumbtack uses a fine-tuned LLM to improve message review precision by 3.7x

Thumbtack's message review system struggled to detect subtle policy violations — nuanced language, sarcasm, and implied threats — that keyword rules and a CNN-based model could not reliably catch, limiting the platform's ability to protect service professionals at scale.

How it works
Common implementation structure
How this type of workflow is generally built, generalized across documented cases — not tied to any one vendor's stack. Click any stage to read what happens there. Specific products that implement these stages appear in “Tools commonly seen” below.
Stage 1 · Message sent on platform
For each message sent on the platform, Thumbtack's message review pipeline is triggered.
Tools used
LLMCNNLangChain
Outcome

The fine-tuned LLM reached an AUC of 0.93, with precision improving by a factor of 3.7 and recall improving 1.5 times over the old system. Using the CNN as a pre-filter reduced LLM processing to around 20% of messages. The system has since processed tens of millions of messages.

What failed first

An off-the-shelf LLM with prompt engineering achieved only an AUC of 0.56, far below production requirements, and the legacy CNN model also struggled with nuanced language, sarcasm, and implied threats.

Results
Volume0.56
Source

https://medium.com/thumbtack-engineering/using-genai-to-enhance-trust-and-safety-at-thumbtack-2b8355556f1f

How we source this →

Grounding & classification
Source type: technical build writeup
28 fields verified against source quotes.
document classificationchat transcriptfailure mode describedhuman review describedmetric backednamed customerproduction runtime claimedsource backedtools describedworkflow describedsoftwareaccuracy improvementcost reductionerror reductiontechnical build writeupcompliance monitoringescalation workflowextract classify routehuman review queue