Clustering – A Powerful Tool for Categorization

I recently extended my document vectorization tool to include clustering capabilities. But what exactly is clustering, and why might you need it? Let's have our AI explain:

Clustering in a vector database is a way to automatically group similar vectors (for instance, vectors derived from documents or images) based on their similarity. This approach helps identify topics or themes in large amounts of data and makes searching far more efficient—queries can focus on the relevant cluster rather than the entire dataset.

How does it work?

Vectorization: First, the raw content (e.g., PDF text) is split into manageable chunks and converted into numerical embeddings via a chosen embedding model.
Similarity Measurement: A measure such as cosine similarity determines how closely these embeddings resemble each other.
Clustering Algorithm: An algorithm like K-means then groups similar embeddings into clusters. Each cluster represents a set of documents or chunks that resemble each other as much as possible.

It's a very handy approach if you're dealing with a large collection of diverse or difficult-to-read documents.

How I'm using clustering

Within my clusteriser tool, you can now set a few key parameters:

n_clusters (line 136): Determines the number of clusters to create.
max_tokens (line 88): Sets the maximum label length (in tokens) for each cluster name.

For more details, see the Clustering.md instructions.

Case study: Poorly readable legal text

As a test, I chose some difficult-to-read legal documents: PDF files from the Draft Budget section at the EUROPA website. This data amounted to roughly 27MB of PDFs. I vectorized the documents with a chunk size of 500 tokens and a maximum overlap of 50 tokens.

Splitting into 5 clusters (max. label length = 4 words):

Cluster	Count
Draft Budget 2025	428
EU Budget Appropriations	1008
EU Budget Overview	1176
EU Regulations and Agencies	585
European Strategic Investments	758

Splitting into 25 clusters (max. label length = 6 words):

<

Cluster	Count
Budget Appropriations Overview	217
Budget Appropriations Summary	230
Budget Estimates and Revenue Summary	116
Draft Budget 2025 Summary	176
EU Agricultural Regulations and Budget	157
EU Budget and Staff Expenditures	117
EU Budget and Staffing Overview	83
EU Budget Draft Summary 2025	298
EU Budget Preparatory Actions Summary	199
EU Draft Budget 2025 Summary	85
EU Financial and Procurement Regulations	230
EU Health and Safety Regulations	145
EU Macro-Financial Assistance Decisions	135
EU Pilot Projects and Budget Appropriations	107
Euratom Nuclear Research and Safety	69
EU Regional Development Fund Appropriations	302
EU Regulations on Migration and Borders	67
European Climate Action Initiatives	141
European Ombudsman Draft Budget 2025	135
European Space Programme Budget Overview	110
European Union Budget and Programs	233

With this kind of automated clustering, you can speed up navigation in extensive, unreadable, and unfamiliar texts—simply tweak the clustering parameters on your already-vectorized database.

Looking ahead

In upcoming blog posts, I'll share more on additional possibilities that vectorized data unlocks. If you'd like to know more, feel free to email me at jaro@nowapp.cz.