Tool for Embedding Files into Your Own Database

07.04.2025

One of the most common uses of AI in companies is performing semantic search within their own documents. At this URL, I present a tool for basic conversion of a series of documents in various formats into a PostgreSQL vector database.

Why Use OpenAI Tools for Creating Vectors?

The reason is simple—based on my experience, these tools handle non-English text most effectively.

Why PostgreSQL Vector Database?

There are many vector databases: Redis, Milvus, Weaviate, etc., which in some cases may be more suitable (performance, large volumes of data, etc.). See this comparison.

Using this tool, you can vectorize a series of files into a single table—and use semantic search, clustering, etc., over this table using SQL queries. Installation and usage instructions are in the README file of this project.

Notes:

  • For initial tests, use the text-embedding-3-small model—it's cheaper.

  • If you have a lot of data, calculate how much you will pay for conversion first—use the -t switch.

  • The ivfflat algorithm is used for creating indexes—but there are more options—see the pg_vector documentation.

Recommended Chunk Size and Overlapping for Different Use Cases:

  • 200–500 tokens for regular document search (overlap 20–40%)

  • 500–1000 tokens for summarization (overlap 10–20%)

  • 50–100 tokens for detailed semantic search (overlap 0%)

  • 200–500 tokens for code search (overlap 0–10%)

  • 1000–1500 tokens for coarse summarization of large datasets (overlap 10–20%)

Additional Tips:

  • For large amounts of data, embedding can take a really long time—tens of hours.

  • If you have a large amount of data, expect the vector database to be significantly larger than the size of the text data itself.

  • All documents are converted to PDF first to identify the approximate occurrence of a search term (at the page level).

  • The search tool is available in the vector_get project.

Good luck with vectorization, and let me know how you like it and what you would like to add! 😊