How Not to Get Lost in the Information Labyrinth

I have been working with semantic search for enterprise environments for quite some time. The data sources involved are highly diverse—not only in format, but also in structure and dynamics. They range from simple static PDF documents to “living” SharePoint ecosystems, where content is continuously added in the form of files, Microsoft Teams conversations, and OneNote documents.
The core challenge of such large-scale information systems is heterogeneity. You approach a collection of invoices very differently from, say, technical manuals for complex information systems. Unsurprisingly, the queries users ask—and the answers they expect—vary dramatically across these domains.
The Easy Part: Building a Vector Database
Creating a vector database is, in itself, relatively straightforward:
- Extract text and metadata from documents of various formats
- Split the content into chunks (with chunk size depending on document type)
- Generate embeddings (vector representations) for each chunk
- Enable semantic search by retrieving chunks with similar vectors
At this point, semantic search works. But unfortunately… not always well enough.
Why Simple RAG Often Fails
If your dataset consists of a small number of documents with similar structure—such as product manuals—basic RAG (vector similarity + a well-tuned system prompt) can be perfectly adequate.
You have 10,000 invoices and ask:
“How much did company XY invoice us for service Z?”
The answer will likely be incorrect or incomplete. Standard RAG pipelines typically operate on a few dozen of the most relevant chunks. That may be insufficient to cover queries that require aggregation, completeness, or exhaustive coverage.
- 💸 Each response becomes more expensive (token-based pricing with paid LLMs)
- 📉 Even hundreds of chunks may still fail to cover all relevant documents at scale
When Simple RAG Is Not Enough
Fortunately, there are well-established techniques that significantly improve retrieval quality when naïve RAG reaches its limits. Below is a practical overview of approaches that help select better candidate passages for the LLM to synthesize an answer from.
Advanced Retrieval Strategies for Enterprise RAG
1. Basic RAG: Pure Vector Similarity (kNN)
query_embedding → top-K nearest chunks(cosine similarity / inner product)- Optional filters: tenant, language, collection, allowed folders, document type
Works well when:
- Chunking is well designed
- Queries are descriptive rather than factual or numeric
2. Hybrid RAG: Vector Search + Full-Text Search (BM25 / FTS)
- Combines semantic similarity with keyword-based search
- Especially effective for invoice numbers, product codes, named fields, and legal references
Typical implementation:
- Retrieve top-K candidates from vector search
- Retrieve top-K candidates from full-text search
- Merge the result sets
3. Rank Fusion: RRF (Reciprocal Rank Fusion)
- Combines rankings from multiple retrievers (vector, FTS, metadata-only, recency-based, …)
- Produces a stable and robust final ranking
In practice: Hybrid retrieval + RRF is often the default production setup.
4. Re-ranking (Two-Stage Retrieval)
Pipeline:
- Fast retrieval (e.g. top-50 or top-100 candidates)
- Re-ranking using a cross-encoder or an LLM acting as a relevance judge
Pros: dramatically improved relevance, especially for long or ambiguous queries.
Cons: higher cost than pure database retrieval.
5. MMR / Context Diversification
- Selects chunks that are not only relevant, but also diverse
- Reduces redundancy in the final context window
Result: better topical coverage and less repetition.
6. Metadata-First / Self-Query RAG
- Translate the natural-language query into structured filters (rules or an LLM), e.g.:
- time range
- author
- department
- document type
- client, project, folder
- Run retrieval only within the narrowed document set
Key benefit: massive precision gains in enterprise data.
7. Parent-Child / Hierarchical RAG
- Store fine-grained child chunks with references to their parent sections/documents
- Retrieve on child chunks, but inject into the prompt:
- surrounding context
- or parent-level summaries
Benefits: fewer hallucinations, better citations, improved coherence.
8. Multi-Hop RAG (Iterative Retrieval)
- Retrieve initial evidence → derive follow-up queries → retrieve more evidence
Useful for: cross-document reasoning (e.g., “What is the impact of X on Y in project Z?”).
9. Query Rewriting / Multi-Query / HyDE
- The LLM generates multiple query variants:
- synonyms
- expanded formulations
- HyDE: hypothetical answer → embedded → used for retrieval
Effective for: short or vague queries.
10. GraphRAG / Entity-Aware RAG
- Extract entities and relationships from chunks
- Build a knowledge graph (or at least an entity index)
- Guide retrieval via entity neighborhoods and relationships
Ideal for: contracts, knowledge bases, and domains with strong entity structure (people, companies, products, processes).
11. PageIndex (Coarse-to-Fine Retrieval)
- Generate embeddings for entire pages or chapters
- Generate standard fine-grained chunks
- Two-phase retrieval:
- Identify relevant pages/chapters
- Search only within those sections
Benefits:
- 📉 fewer random chunks
- 📈 higher precision
- ⚡ faster queries
- 🧠 more coherent context for the LLM
Implementation and Cost Considerations
Implementing pure vector search is relatively simple and inexpensive. Document embedding is typically cheap compared to LLM-heavy processing steps such as re-ranking or graph construction.
With the text-embedding-3-small model, the rough scale is on the order of
tens of millions of tokens per $1 (exact page-equivalents depend heavily on language, formatting, and what you count as an “A4 page”).
More advanced techniques—such as GraphRAG or large-scale re-ranking—are significantly more demanding in both implementation effort and cost,
but can deliver substantially better results.
Final Thoughts
In practice, the best results come from combining multiple retrieval strategies. However, selecting the right mix for a specific document domain is far from trivial. It requires experimentation, domain knowledge, and a deep understanding of user behavior.
I am happy to share further experiences, patterns, and practical recommendations in this area.