How Grab built a custom vision LLM to improve document processing for eKYC
Traditional OCR systems struggled with the diversity of Southeast Asian languages and document formats, while proprietary LLMs produced errors, hallucinations, and high latency, and open-source Vision LLMs lacked sufficient accuracy for production use in eKYC workflows.
How it works
Common implementation structure
How this type of workflow is generally built, generalized across documented cases — not tied to any one vendor's stack. Click any stage to read what happens there. Specific products that implement these stages appear in “Tools commonly seen” below.
Stage 1 · User document submission
User-submitted documents such as ID cards, driver's licenses, and registration certificates initiate the eKYC process.
Tools used
Qwen2.5 0.5BDocumintCommon Crawl
Outcome
Grab's custom ~1B parameter Vision LLM achieved accuracy within 3pp of the larger 2B model, with Thai document accuracy improving +70pp and Vietnamese +40pp over baseline, while delivering latency that far outperforms traditional OCR models and external APIs.
What failed first
LoRA fine-tuning of Qwen2VL showed promising results for Latin-script documents but still struggled with Thai and Vietnamese documents and unstructured layouts with small, dense text, because open-source Vision LLMs lacked visual text in SEA languages during vision encoder training.