Tool for Embedding Files into Your Own Database

07.04.2025

One of the most common uses of AI in companies is performing semantic search within their own documents. At this URL, I present a tool for basic conversion of a series of documents in various formats into a PostgreSQL vector database.

Why Use OpenAI Tools for Creating Vectors?

The reason is simple—based on my experience, these tools handle non-English text most effectively.

Why PostgreSQL Vector Database?

There are many vector databases: Redis, Milvus, Weaviate, etc., which in some cases may be more suitable (performance, large volumes of data, etc.). See this comparison.

Using this tool, you can vectorize a series of files into a single table—and use semantic search, clustering, etc., over this table using SQL queries. Installation and usage instructions are in the README file of this project.

Notes:

For initial tests, use the text-embedding-3-small model—it's cheaper.
If you have a lot of data, calculate how much you will pay for conversion first—use the -t switch.
The ivfflat algorithm is used for creating indexes—but there are more options—see the pg_vector documentation.

Recommended Chunk Size and Overlapping for Different Use Cases:

200–500 tokens for regular document search (overlap 20–40%)
500–1000 tokens for summarization (overlap 10–20%)
50–100 tokens for detailed semantic search (overlap 0%)
200–500 tokens for code search (overlap 0–10%)
1000–1500 tokens for coarse summarization of large datasets (overlap 10–20%)

Additional Tips:

For large amounts of data, embedding can take a really long time—tens of hours.
If you have a large amount of data, expect the vector database to be significantly larger than the size of the text data itself.
All documents are converted to PDF first to identify the approximate occurrence of a search term (at the page level).
The search tool is available in the vector_get project.

Good luck with vectorization, and let me know how you like it and what you would like to add! 😊