compliance_monitoring · finance · workflow
Capital One refines LLM input guardrails with chain-of-thought prompting and fine-tuned alignment
LLM-powered applications at Capital One faced adversarial attacks—including jailbreak prompts and prompt injections—that could cause unsafe outputs, while base open-source LLMs lacked sufficient detection capability.
How it works
Common implementation structure
How this type of workflow is generally built, generalized across documented cases — not tied to any one vendor's stack. Click any stage to read what happens there. Specific products that implement these stages appear in “Tools commonly seen” below.
Stage 1 · User input intercepted
Input moderation guardrails intercept user inputs before they reach the main conversation-driving LLM.
Tools used
Mistral 7B Instruct v2Mixtral 8x7B Instruct v1Llama2 13B ChatLlama3 8B InstructLoRACoTSFTDPOKTO
Outcome
Fine-tuning with SFT, DPO, and KTO yielded over 50% improvements in F1 score and attack detection ratio with only a maximum 1.5% increase in false positive rate, and the best model—DPO-aligned Llama3 8B—outperformed LlamaGuard-2 and other public guardrail models by wide margins.
What failed first
Base open-source LLMs achieved F1 scores well below 80% on adversarial input detection, and performance gaps on jailbreaks and prompt injections persisted even in many-shot settings.
Results
Volume50+%
Grounding & classification
Source type: technical build writeup
28 fields verified against source quotes.
anomaly detectiondocument classificationchat transcriptfailure mode describedmetric backednamed customersource backedtools describedworkflow describedbankingaccuracy improvementerror reductiontechnical build writeupcompliance monitoringquality assuranceextract classify route