incident_management · saas · workflow
Behind the scenes of Elastic Security's generative AI features: a quantitative approach to prompt tuning and LLM evaluation
Elastic's Security GenAI team needed a robust, reproducible way to evaluate LLM prompt quality and compare model providers at scale; the initial approach relied on manual spreadsheet-based testing that was effective but time-intensive and did not scale as more features were added.
How it works
Common implementation structure
How this type of workflow is generally built, generalized across documented cases — not tied to any one vendor's stack. Click any stage to read what happens there. Specific products that implement these stages appear in “Tools commonly seen” below.
Stage 1 · User submits request
The workflow begins when the system receives a user request.
Tools used
Elastic AI AssistantAttack DiscoveryAutomatic ImportLangSmithLangGraphElasticsearchES|QL
Outcome
Elastic transitioned from manual to automated LLM evaluations using LangSmith and LangGraph, building a framework that enables quantitative comparison of prompts and models, with a real-time rubric check in production that regenerates responses if they fall below quality standards.
Results
Volume85%
Running sinceJune 2023
Grounding & classification
Source type: technical build writeup
27 fields verified against source quotes.
agentic workflowanomaly detectionconversational airagsummarizationknowledge basebuilder submittedhuman review describedproduction runtime claimedtools describedworkflow describedsoftwareaccuracy improvementemployee productivitytechnical build writeupincident managementmonitor detect alertrag answering