compliance_monitoring · finance · workflow

Capital One refines LLM input guardrails with chain-of-thought prompting and fine-tuned alignment

LLM-powered applications at Capital One faced adversarial attacks—including jailbreak prompts and prompt injections—that could cause unsafe outputs, while base open-source LLMs lacked sufficient detection capability.

How it works

Common implementation structure

How this type of workflow is generally built, generalized across documented cases — not tied to any one vendor's stack. Click any stage to read what happens there. Specific products that implement these stages appear in “Tools commonly seen” below.

Stage 1 · User input intercepted

Input moderation guardrails intercept user inputs before they reach the main conversation-driving LLM.

Tools used

Mistral 7B Instruct v2Mixtral 8x7B Instruct v1Llama2 13B ChatLlama3 8B InstructLoRACoTSFTDPOKTO

Outcome

Fine-tuning with SFT, DPO, and KTO yielded over 50% improvements in F1 score and attack detection ratio with only a maximum 1.5% increase in false positive rate, and the best model—DPO-aligned Llama3 8B—outperformed LlamaGuard-2 and other public guardrail models by wide margins.

What failed first

Base open-source LLMs achieved F1 scores well below 80% on adversarial input detection, and performance gaps on jailbreaks and prompt injections persisted even in many-shot settings.

Results

Volume50+%

Source

https://medium.com/capital-one-tech/refining-input-guardrails-for-safer-llm-applications-capital-one-715c1c440e6b

How we source this →

Grounding & classification

Source type: technical build writeup

28 fields verified against source quotes.

anomaly detectiondocument classificationchat transcriptfailure mode describedmetric backednamed customersource backedtools describedworkflow describedbankingaccuracy improvementerror reductiontechnical build writeupcompliance monitoringquality assuranceextract classify route