Vectorising Corporate Documents in Practice
			            
How I combined MCP, Python and PostgreSQL into a single workflow
Why the “real world” is tougher than demo projects
When I built a proof-of-concept for vectorising contracts stored on SharePoint, I immediately ran into three classic obstacles:
- Diverse formats — PDF, DOC, DOCX …
 - Multiple languages — every document can be in a different language
 - Scanned pages — part of the content is just a bitmap, so OCR is required
 
On top of that, files are constantly added, removed, modified – which means the workflow has to be fully incremental.
Indexing pipeline step by step
I split the whole process into nine independent modules connected via the Model Context Protocol (MCP). Each module consumes and returns JSON and is written either in Python or C#.
| # | Module | Key tech | Responsibility | 
|---|---|---|---|
| 1 | SharePoint change watcher | office365 | Returns file-URL + change type | 
| 2 | Export to PDF | headless LibreOffice | Keeps pagination for precise links | 
| 3 | OCR | PyTesseract | Converts scans to text | 
| 4 | PDF text extractor | PyMuPDF | Fast & reliable parser | 
| 5 | Language detection | model o4-mini | Distinguishes EN/CZ … | 
| 6 | AI post-processing | OpenAI agent | Fixes common OCR errors (“l” ↔ “1”, “o” ↔ “0”…) | 
| 7 | Translation to one language | OpenAI | Simplifies searching | 
| 8 | Chunking + overlap | langchain, tiktoken | Better context boundaries | 
| 9 | Embedding & storage | PostgreSQL 17 + pg_vector, text-embedding-3-small | Normalises text, computes vector, writes/updates DB | 
Pro tip: Before embedding I convert everything to lowercase and strip extra spaces – it noticeably improves chunk matching.
Table embeddings schema
CREATE TABLE embeddings (
    id           SERIAL,
    file_name    TEXT    NOT NULL,
    page         INTEGER NOT NULL,
    chunk_id     INTEGER NOT NULL,
    text_original TEXT   NOT NULL,
    text_cs      TEXT,
    language     TEXT,
    tokens_nr    INTEGER NOT NULL,
    is_scan      BOOLEAN,
    embed_vector VECTOR(1536)
);
CREATE INDEX ON embeddings USING ivfflat (embed_vector);
This gives us a vector index ready for lightning-fast queries.
Search layer: vectors + full-text + SQL
Vector similarity alone is a necessary but not sufficient foundation. On top of the table I expose three functions:
get_vector– classic ANN search
parameters: normalised prompt, number of chunks, thresholdget_fulltext– full-text ontext_cswithunaccent& language stemmerget_sql– open field for any SQL query the LLM builds itself
Above that runs an orchestrator (OpenAI model) – it picks the right method, converts the query to embeddings and formats an HTML answer with a list of “clickable” links back to SharePoint. Fun fact: the system prompt alone is several pages long – the LLM truly knows what it’s doing!
What proved valuable in practice 💡
- PDF as a unifying format – pagination gives explainability in a single click.
 - Single-language embeddings – translating into one language unified the vector space and improved recall.
 - Hybrid queries – combining vector + full-text + SQL gives analysts and the LLM orchestrator much more room to manoeuvre.
 
Conclusion
Building robust enterprise search over hundreds of living documents isn’t a matter of one script but of a modular, scalable architecture. The result, however, massively extends data-mining possibilities and saves hours of manual digging in contracts and reports.
  TL;DR — If you face a similar task, separate concerns, modularise,
  combine techniques. MCP + Python + PostgreSQL with pg_vector
  form a rock-solid foundation for a long-term vector infrastructure.