MercadoLibre's Financial Data Enrichment: from handcrafted regex to LLMs and custom semantic embeddings in LATAM
MELI's transaction categorization relied on handcrafted regex rules and manually-reported MCC codes that produced frequent inconsistencies, required constant country-specific updates, and could not scale to the daily volume of new financial data across LATAM's diverse languages.
How it works
Common implementation structure
How this type of workflow is generally built, generalized across documented cases — not tied to any one vendor's stack. Click any stage to read what happens there. Specific products that implement these stages appear in “Tools commonly seen” below.
Stage 1 · Raw transaction data arrives
Raw financial transaction data enters the enrichment pipeline to be turned into structured insights.
Adopting GPT-3.5 Turbo lifted categorization accuracy from around 60% to over 80%, cut operational costs by 75%, and scaled volume from tens of millions per quarter to tens of millions per week. Custom BERT-style embeddings then pushed accuracy to 90% with an additional cost reduction of more than 30%, a 10x increase in scalability, and near real-time processing.
What failed first
MELI's first deployed categorization model, built entirely on regex and MCC rules, was limited to debit transactions in Portuguese and proved impossible to maintain as data volume grew.