quality_assurance · saas · workflow

Patronus AI trains Lynx hallucination detection model on Databricks Mosaic AI, outperforming GPT-4o

LLMs used in RAG applications produce hallucinations that expose users to misinformation; existing LLM-as-a-judge evaluators — including top-performing closed-source models like GPT-4 — frequently fail on complex reasoning tasks; and a significant performance gap exists between open-source and closed-source evaluation models due to lack of challenging domain-specific datasets.

How it works

Common implementation structure

How this type of workflow is generally built, generalized across documented cases — not tied to any one vendor's stack. Click any stage to read what happens there. Specific products that implement these stages appear in “Tools commonly seen” below.

Stage 1 · Dataset construction via perturbation

Training and evaluation datasets for a hallucination identification task were constructed using a perturbation process.

Tools used

LLM FoundryComposerDatabricks Model TrainingDatabricks Mosaic AILlama-3-70B-InstructHuggingFace

Outcome

Lynx outperformed all existing LLM-as-a-judge evaluators on HaluBench, surpassing GPT-4o by almost 1% in accuracy across all tasks and showing a 7.5% difference in medical question-answering, and is the best-performing open-source model by a wide margin.

What failed first

Even top-performing closed-source models like GPT-4 used as LLM-as-a-judge evaluators frequently fail to accurately evaluate complex reasoning tasks, with additional concerns about quality, transparency, and cost of closed-source LLMs.

Results

Volumealmost 1%

Source

https://www.databricks.com/blog/patronus-ai-lynx

How we source this →

Grounding & classification

Source type: technical build writeup

24 fields verified against source quotes.

quality inspectionragknowledge basebuilder submittedfailure mode describedmetric backednamed customersource backedtools describedworkflow describedsoftwareaccuracy improvementerror reductiontechnical build writeupquality assurancemonitor detect alert