Multimodal RAG with Vision: Microsoft ISE experiments on image-enriched document retrieval for enterprise Q&A
Enterprise documents contain both textual and image content such as photographs, diagrams, and screenshots; standard text-only RAG pipelines cannot surface image information, limiting the relevance of LLM responses to image-related queries.
How it works
Common implementation structure
How this type of workflow is generally built, generalized across documented cases — not tied to any one vendor's stack. Click any stage to read what happens there. Specific products that implement these stages appear in “Tools commonly seen” below.
Stage 1 · Document ingestion via custom loader
The ingestion process extracts both text and image data from source documents using a custom loader.
Tools used
GPT-4VGPT-4oAzure AI SearchAzure Computer Vision Image AnalysisAzure OpenAI ServiceAzure AI Services
Outcome
Including document metadata produced a statistically significant improvement in source recall; storing image annotations as separate chunks yielded notable statistical improvements in both source document and image retrieval metrics; and introducing an image classifier substantially reduced ingestion time while maintaining statistically similar recall.
What failed first
Multi-modal embeddings like CLIP were initially considered but rejected due to word-count limits and inability to capture detailed visual information; the inference model also did not reliably return parsable JSON output.
Results
Time savedsubstantially reducing ingestion time
Volumestatistically significant improvement in source recall performance