Building a RAG system for internal engineering knowledge search from 1 TB of project documents
An engineering company needed an internal natural language chat tool to search across nearly a decade of project history totaling 1 TB of mixed technical documents—including OrcaFlex simulation files central to the offshore industry—without relying on external APIs for confidentiality reasons.
How it works
Common implementation structure
How this type of workflow is generally built, generalized across documented cases — not tied to any one vendor's stack. Click any stage to read what happens there. Specific products that implement these stages appear in “Tools commonly seen” below.
Stage 1 · Document filtering by extension and pattern
A filtering system excluded files by extension and name patterns before indexing to prevent non-text assets from entering the pipeline.
The RAG system reached production with 738,470 vectors and a 54 GB index in ChromaDB, achieved a 54% reduction in files to index through filtering, and is described as fast, reliable, and useful for colleagues. The GPU indexing phase cost 184 euros on Hetzner.
What failed first
LlamaIndex overflowed the laptop's RAM when processing large non-text files; a custom checkpoint system suffered data corruption and was too slow; the laptop's integrated GPU required 4-5 hours per 500 MB; and the production VM had only 100 GB of disk, far short of the full document corpus.