Vectorising Corporate Documents in Practice

How I combined MCP, Python and PostgreSQL into a single workflow
Why the “real world” is tougher than demo projects
When I built a proof-of-concept for vectorising contracts stored on SharePoint, I immediately ran into three classic obstacles:
- Diverse formats — PDF, DOC, DOCX …
- Multiple languages — every document can be in a different language
- Scanned pages — part of the content is just a bitmap, so OCR is required
On top of that, files are constantly added, removed, modified – which means the workflow has to be fully incremental.
Indexing pipeline step by step
I split the whole process into nine independent modules connected via the Model Context Protocol (MCP). Each module consumes and returns JSON and is written either in Python or C#.
# | Module | Key tech | Responsibility |
---|---|---|---|
1 | SharePoint change watcher | office365 | Returns file-URL + change type |
2 | Export to PDF | headless LibreOffice | Keeps pagination for precise links |
3 | OCR | PyTesseract | Converts scans to text |
4 | PDF text extractor | PyMuPDF | Fast & reliable parser |
5 | Language detection | model o4-mini | Distinguishes EN/CZ … |
6 | AI post-processing | OpenAI agent | Fixes common OCR errors (“l” ↔ “1”, “o” ↔ “0”…) |
7 | Translation to one language | OpenAI | Simplifies searching |
8 | Chunking + overlap | langchain, tiktoken | Better context boundaries |
9 | Embedding & storage | PostgreSQL 17 + pg_vector , text-embedding-3-small | Normalises text, computes vector, writes/updates DB |
Pro tip: Before embedding I convert everything to lowercase and strip extra spaces – it noticeably improves chunk matching.
Table embeddings
schema
CREATE TABLE embeddings (
id SERIAL,
file_name TEXT NOT NULL,
page INTEGER NOT NULL,
chunk_id INTEGER NOT NULL,
text_original TEXT NOT NULL,
text_cs TEXT,
language TEXT,
tokens_nr INTEGER NOT NULL,
is_scan BOOLEAN,
embed_vector VECTOR(1536)
);
CREATE INDEX ON embeddings USING ivfflat (embed_vector);
This gives us a vector index ready for lightning-fast queries.
Search layer: vectors + full-text + SQL
Vector similarity alone is a necessary but not sufficient foundation. On top of the table I expose three functions:
get_vector
– classic ANN search
parameters: normalised prompt, number of chunks, thresholdget_fulltext
– full-text ontext_cs
withunaccent
& language stemmerget_sql
– open field for any SQL query the LLM builds itself
Above that runs an orchestrator (OpenAI model) – it picks the right method, converts the query to embeddings and formats an HTML answer with a list of “clickable” links back to SharePoint. Fun fact: the system prompt alone is several pages long – the LLM truly knows what it’s doing!
What proved valuable in practice 💡
- PDF as a unifying format – pagination gives explainability in a single click.
- Single-language embeddings – translating into one language unified the vector space and improved recall.
- Hybrid queries – combining vector + full-text + SQL gives analysts and the LLM orchestrator much more room to manoeuvre.
Conclusion
Building robust enterprise search over hundreds of living documents isn’t a matter of one script but of a modular, scalable architecture. The result, however, massively extends data-mining possibilities and saves hours of manual digging in contracts and reports.
TL;DR — If you face a similar task, separate concerns, modularise,
combine techniques. MCP + Python + PostgreSQL with pg_vector
form a rock-solid foundation for a long-term vector infrastructure.