field_service · energy · workflow

Multimodal RAG with Vision: Microsoft ISE experiments on image-enriched document retrieval for enterprise Q&A

Enterprise documents contain both textual and image content such as photographs, diagrams, and screenshots; standard text-only RAG pipelines cannot surface image information, limiting the relevance of LLM responses to image-related queries.

How it works
Common implementation structure
How this type of workflow is generally built, generalized across documented cases — not tied to any one vendor's stack. Click any stage to read what happens there. Specific products that implement these stages appear in “Tools commonly seen” below.
Stage 1 · Document ingestion via custom loader
The ingestion process extracts both text and image data from source documents using a custom loader.
Tools used
GPT-4VGPT-4oAzure AI SearchAzure Computer Vision Image AnalysisAzure OpenAI ServiceAzure AI Services
Outcome

Including document metadata produced a statistically significant improvement in source recall; storing image annotations as separate chunks yielded notable statistical improvements in both source document and image retrieval metrics; and introducing an image classifier substantially reduced ingestion time while maintaining statistically similar recall.

What failed first

Multi-modal embeddings like CLIP were initially considered but rejected due to word-count limits and inability to capture detailed visual information; the inference model also did not reliably return parsable JSON output.

Results
Time savedsubstantially reducing ingestion time
Volumestatistically significant improvement in source recall performance
Source

https://devblogs.microsoft.com/ise/multimodal-rag-with-vision/

How we source this →

Grounding & classification
Source type: technical build writeup
28 fields verified against source quotes.
computer visiondocument aienterprise searchragsummarizationknowledge basefailure mode describedmetric backedproduction runtime claimedtools describedworkflow describedtelecomaccuracy improvementcycle time reductiontechnical build writeupback office opsfield servicedocument to recordextract classify routerag answering