Patronus AI trains Lynx hallucination detection model on Databricks Mosaic AI, outperforming GPT-4o
LLMs used in RAG applications produce hallucinations that expose users to misinformation; existing LLM-as-a-judge evaluators — including top-performing closed-source models like GPT-4 — frequently fail on complex reasoning tasks; and a significant performance gap exists between open-source and closed-source evaluation models due to lack of challenging domain-specific datasets.
How it works
Common implementation structure
How this type of workflow is generally built, generalized across documented cases — not tied to any one vendor's stack. Click any stage to read what happens there. Specific products that implement these stages appear in “Tools commonly seen” below.
Stage 1 · Dataset construction via perturbation
Training and evaluation datasets for a hallucination identification task were constructed using a perturbation process.
Tools used
LLM FoundryComposerDatabricks Model TrainingDatabricks Mosaic AILlama-3-70B-InstructHuggingFace
Outcome
Lynx outperformed all existing LLM-as-a-judge evaluators on HaluBench, surpassing GPT-4o by almost 1% in accuracy across all tasks and showing a 7.5% difference in medical question-answering, and is the best-performing open-source model by a wide margin.
What failed first
Even top-performing closed-source models like GPT-4 used as LLM-as-a-judge evaluators frequently fail to accurately evaluate complex reasoning tasks, with additional concerns about quality, transparency, and cost of closed-source LLMs.