Vectorising Corporate Documents in Practice

02.06.2025

How I combined MCP, Python and PostgreSQL into a single workflow

Why the “real world” is tougher than demo projects

When I built a proof-of-concept for vectorising contracts stored on SharePoint, I immediately ran into three classic obstacles:

Diverse formats — PDF, DOC, DOCX …
Multiple languages — every document can be in a different language
Scanned pages — part of the content is just a bitmap, so OCR is required

On top of that, files are constantly added, removed, modified – which means the workflow has to be fully incremental.

Indexing pipeline step by step

I split the whole process into nine independent modules connected via the Model Context Protocol (MCP). Each module consumes and returns JSON and is written either in Python or C#.

#	Module	Key tech	Responsibility
1	SharePoint change watcher	`office365`	Returns file-URL + change type
2	Export to PDF	headless LibreOffice	Keeps pagination for precise links
3	OCR	PyTesseract	Converts scans to text
4	PDF text extractor	PyMuPDF	Fast & reliable parser
5	Language detection	model `o4-mini`	Distinguishes EN/CZ …
6	AI post-processing	OpenAI agent	Fixes common OCR errors (“l” ↔ “1”, “o” ↔ “0”…)
7	Translation to one language	OpenAI	Simplifies searching
8	Chunking + overlap	langchain, tiktoken	Better context boundaries
9	Embedding & storage	PostgreSQL 17 + `pg_vector`, text-embedding-3-small	Normalises text, computes vector, writes/updates DB

Pro tip: Before embedding I convert everything to lowercase and strip extra spaces – it noticeably improves chunk matching.

Table `embeddings` schema

CREATE TABLE embeddings (
    id           SERIAL,
    file_name    TEXT    NOT NULL,
    page         INTEGER NOT NULL,
    chunk_id     INTEGER NOT NULL,
    text_original TEXT   NOT NULL,
    text_cs      TEXT,
    language     TEXT,
    tokens_nr    INTEGER NOT NULL,
    is_scan      BOOLEAN,
    embed_vector VECTOR(1536)
);
CREATE INDEX ON embeddings USING ivfflat (embed_vector);

This gives us a vector index ready for lightning-fast queries.

Search layer: vectors + full-text + SQL

Vector similarity alone is a necessary but not sufficient foundation. On top of the table I expose three functions:

get_vector – classic ANN search
parameters: normalised prompt, number of chunks, threshold
get_fulltext – full-text on text_cs with unaccent & language stemmer
get_sql – open field for any SQL query the LLM builds itself

Above that runs an orchestrator (OpenAI model) – it picks the right method, converts the query to embeddings and formats an HTML answer with a list of “clickable” links back to SharePoint. Fun fact: the system prompt alone is several pages long – the LLM truly knows what it’s doing!

What proved valuable in practice 💡

PDF as a unifying format – pagination gives explainability in a single click.
Single-language embeddings – translating into one language unified the vector space and improved recall.
Hybrid queries – combining vector + full-text + SQL gives analysts and the LLM orchestrator much more room to manoeuvre.

Conclusion

Building robust enterprise search over hundreds of living documents isn’t a matter of one script but of a modular, scalable architecture. The result, however, massively extends data-mining possibilities and saves hours of manual digging in contracts and reports.

TL;DR — If you face a similar task, separate concerns, modularise, combine techniques. MCP + Python + PostgreSQL with pg_vector form a rock-solid foundation for a long-term vector infrastructure.