Databricks builds a bespoke fine-tuned LLM for AI-generated data catalog documentation in 1 month for under $1,000
In virtually every organization, the vast majority of database tables are undocumented, making it difficult for humans to discover data and for AI agents to automatically find datasets. An initial prototype using off-the-shelf SaaS LLMs ran into challenges with quality, performance, and cost that blocked production launch.
How it works
Common implementation structure
How this type of workflow is generally built, generalized across documented cases — not tied to any one vendor's stack. Click any stage to read what happens there. Specific products that implement these stages appear in “Tools commonly seen” below.
Stage 1 · Schema-based doc generation trigger
The workflow automatically generates documentation for tables and their columns based on their schema.
Tools used
Unity CatalogMPT-7BDatabricks Data Intelligence Platform
Outcome
Databricks built and deployed a bespoke fine-tuned LLM that delivered better quality, higher throughput, and more than a 10-fold reduction in cost, with more than 80% of table metadata updates now AI-assisted in production on Amazon Web Services and Google Cloud.
What failed first
All tested versions of SaaS LLMs exhibited the same challenges: as general-purpose models they were too slow and costly at scale, and risked regressions on the narrow documentation use case as they evolved for other use cases.