field_service · energy · workflow

Multimodal RAG with Vision: Microsoft ISE experiments on image-enriched document retrieval for enterprise Q&A

Enterprise documents contain both textual and image content such as photographs, diagrams, and screenshots; standard text-only RAG pipelines cannot surface image information, limiting the relevance of LLM responses to image-related queries.

How it works

Common implementation structure

How this type of workflow is generally built, generalized across documented cases — not tied to any one vendor's stack. Click any stage to read what happens there. Specific products that implement these stages appear in “Tools commonly seen” below.

Stage 1 · Document ingestion via custom loader

The ingestion process extracts both text and image data from source documents using a custom loader.

Tools used

GPT-4VGPT-4oAzure AI SearchAzure Computer Vision Image AnalysisAzure OpenAI ServiceAzure AI Services

Outcome

Including document metadata produced a statistically significant improvement in source recall; storing image annotations as separate chunks yielded notable statistical improvements in both source document and image retrieval metrics; and introducing an image classifier substantially reduced ingestion time while maintaining statistically similar recall.

What failed first

Multi-modal embeddings like CLIP were initially considered but rejected due to word-count limits and inability to capture detailed visual information; the inference model also did not reliably return parsable JSON output.

Results

Time savedsubstantially reducing ingestion time

Volumestatistically significant improvement in source recall performance

Source

https://devblogs.microsoft.com/ise/multimodal-rag-with-vision/

How we source this →

Grounding & classification

Source type: technical build writeup

28 fields verified against source quotes.

computer visiondocument aienterprise searchragsummarizationknowledge basefailure mode describedmetric backedproduction runtime claimedtools describedworkflow describedtelecomaccuracy improvementcycle time reductiontechnical build writeupback office opsfield servicedocument to recordextract classify routerag answering