quality_assurance · services · workflow

Harvey scales AI evaluation for legal work through expert feedback, automated pipelines, and custom data infrastructure

Ensuring Harvey's AI systems consistently deliver accurate, helpful, and properly sourced legal answers requires evaluation that can scale beyond manual expert review, which is constrained by data scarcity, feedback latency, fragmented expertise across jurisdictions, and regression risks when changes improve one area but degrade another.

How it works

Common implementation structure

How this type of workflow is generally built, generalized across documented cases — not tied to any one vendor's stack. Click any stage to read what happens there. Specific products that implement these stages appear in “Tools commonly seen” below.

Stage 1 · Domain expert review

Legal specialists and tax experts collaborate directly with engineers to review AI outputs and ground improvements in real-world professional needs.

Tools used

GPT-4.1GPT-4oLLMcustom embedding pipeline

Outcome

Harvey's evaluation system validated shifting workloads to GPT-4.1, which improved mean answer ratings by over 10%, and the citation verification system achieved over 95% accuracy on an internal benchmark validated by attorneys.

Results

Volumeover 10%

Source

https://www.harvey.ai/blog/scaling-ai-evaluation-through-expertise

How we source this →

Grounding & classification

Source type: technical build writeup

26 fields verified against source quotes.

agentic workflowdocument airagsummarizationknowledge basehuman review describedmetric backednamed customerproduction runtime claimedtools describedworkflow describedlegalsoftwareaccuracy improvementerror reductiontechnical build writeuplegal document reviewquality assurancehuman review queuemonitor detect alert