compliance_monitoring · saas · workflow

Thumbtack uses a fine-tuned LLM to improve message review precision by 3.7x

Thumbtack's message review system struggled to detect subtle policy violations — nuanced language, sarcasm, and implied threats — that keyword rules and a CNN-based model could not reliably catch, limiting the platform's ability to protect service professionals at scale.

How it works

Common implementation structure

How this type of workflow is generally built, generalized across documented cases — not tied to any one vendor's stack. Click any stage to read what happens there. Specific products that implement these stages appear in “Tools commonly seen” below.

Stage 1 · Message sent on platform

For each message sent on the platform, Thumbtack's message review pipeline is triggered.

Tools used

LLMCNNLangChain

Outcome

The fine-tuned LLM reached an AUC of 0.93, with precision improving by a factor of 3.7 and recall improving 1.5 times over the old system. Using the CNN as a pre-filter reduced LLM processing to around 20% of messages. The system has since processed tens of millions of messages.

What failed first

An off-the-shelf LLM with prompt engineering achieved only an AUC of 0.56, far below production requirements, and the legacy CNN model also struggled with nuanced language, sarcasm, and implied threats.

Results

Volume0.56

Source

https://medium.com/thumbtack-engineering/using-genai-to-enhance-trust-and-safety-at-thumbtack-2b8355556f1f

How we source this →

Grounding & classification

Source type: technical build writeup

28 fields verified against source quotes.

document classificationchat transcriptfailure mode describedhuman review describedmetric backednamed customerproduction runtime claimedsource backedtools describedworkflow describedsoftwareaccuracy improvementcost reductionerror reductiontechnical build writeupcompliance monitoringescalation workflowextract classify routehuman review queue