quality_assurance · services · workflow
Harvey scales AI evaluation for legal work through expert feedback, automated pipelines, and custom data infrastructure
Ensuring Harvey's AI systems consistently deliver accurate, helpful, and properly sourced legal answers requires evaluation that can scale beyond manual expert review, which is constrained by data scarcity, feedback latency, fragmented expertise across jurisdictions, and regression risks when changes improve one area but degrade another.
How it works
Common implementation structure
How this type of workflow is generally built, generalized across documented cases — not tied to any one vendor's stack. Click any stage to read what happens there. Specific products that implement these stages appear in “Tools commonly seen” below.
Stage 1 · Domain expert review
Legal specialists and tax experts collaborate directly with engineers to review AI outputs and ground improvements in real-world professional needs.
Tools used
GPT-4.1GPT-4oLLMcustom embedding pipeline
Outcome
Harvey's evaluation system validated shifting workloads to GPT-4.1, which improved mean answer ratings by over 10%, and the citation verification system achieved over 95% accuracy on an internal benchmark validated by attorneys.
Results
Volumeover 10%
Grounding & classification
Source type: technical build writeup
26 fields verified against source quotes.
agentic workflowdocument airagsummarizationknowledge basehuman review describedmetric backednamed customerproduction runtime claimedtools describedworkflow describedlegalsoftwareaccuracy improvementerror reductiontechnical build writeuplegal document reviewquality assurancehuman review queuemonitor detect alert