Vectorising Corporate Documents in Practice

02.06.2025

How I combined MCP, Python and PostgreSQL into a single workflow

Why the “real world” is tougher than demo projects

When I built a proof-of-concept for vectorising contracts stored on SharePoint, I immediately ran into three classic obstacles:

  • Diverse formats — PDF, DOC, DOCX …
  • Multiple languages — every document can be in a different language
  • Scanned pages — part of the content is just a bitmap, so OCR is required

On top of that, files are constantly added, removed, modified – which means the workflow has to be fully incremental.


Indexing pipeline step by step

I split the whole process into nine independent modules connected via the Model Context Protocol (MCP). Each module consumes and returns JSON and is written either in Python or C#.

# Module Key tech Responsibility
1SharePoint change watcheroffice365Returns file-URL + change type
2Export to PDFheadless LibreOfficeKeeps pagination for precise links
3OCRPyTesseractConverts scans to text
4PDF text extractorPyMuPDFFast & reliable parser
5Language detectionmodel o4-miniDistinguishes EN/CZ …
6AI post-processingOpenAI agentFixes common OCR errors (“l” ↔ “1”, “o” ↔ “0”…)
7Translation to one languageOpenAISimplifies searching
8Chunking + overlaplangchain, tiktokenBetter context boundaries
9Embedding & storagePostgreSQL 17 + pg_vector, text-embedding-3-smallNormalises text, computes vector, writes/updates DB
Pro tip: Before embedding I convert everything to lowercase and strip extra spaces – it noticeably improves chunk matching.

Table embeddings schema

CREATE TABLE embeddings (
    id           SERIAL,
    file_name    TEXT    NOT NULL,
    page         INTEGER NOT NULL,
    chunk_id     INTEGER NOT NULL,
    text_original TEXT   NOT NULL,
    text_cs      TEXT,
    language     TEXT,
    tokens_nr    INTEGER NOT NULL,
    is_scan      BOOLEAN,
    embed_vector VECTOR(1536)
);
CREATE INDEX ON embeddings USING ivfflat (embed_vector);

This gives us a vector index ready for lightning-fast queries.


Search layer: vectors + full-text + SQL

Vector similarity alone is a necessary but not sufficient foundation. On top of the table I expose three functions:

  1. get_vector – classic ANN search
    parameters: normalised prompt, number of chunks, threshold
  2. get_fulltext – full-text on text_cs with unaccent & language stemmer
  3. get_sql – open field for any SQL query the LLM builds itself

Above that runs an orchestrator (OpenAI model) – it picks the right method, converts the query to embeddings and formats an HTML answer with a list of “clickable” links back to SharePoint. Fun fact: the system prompt alone is several pages long – the LLM truly knows what it’s doing!


What proved valuable in practice 💡

  • PDF as a unifying format – pagination gives explainability in a single click.
  • Single-language embeddings – translating into one language unified the vector space and improved recall.
  • Hybrid queries – combining vector + full-text + SQL gives analysts and the LLM orchestrator much more room to manoeuvre.

Conclusion

Building robust enterprise search over hundreds of living documents isn’t a matter of one script but of a modular, scalable architecture. The result, however, massively extends data-mining possibilities and saves hours of manual digging in contracts and reports.

TL;DR — If you face a similar task, separate concerns, modularise, combine techniques. MCP + Python + PostgreSQL with pg_vector form a rock-solid foundation for a long-term vector infrastructure.