Clustering – A Powerful Tool for Categorization

10.04.2025

Clustering – A Powerful Tool for Categorization

I recently extended my document vectorization tool to include clustering capabilities. But what exactly is clustering, and why might you need it? Let's have our AI explain:

Clustering in a vector database is a way to automatically group similar vectors (for instance, vectors derived from documents or images) based on their similarity. This approach helps identify topics or themes in large amounts of data and makes searching far more efficient—queries can focus on the relevant cluster rather than the entire dataset.

How does it work?

  1. Vectorization: First, the raw content (e.g., PDF text) is split into manageable chunks and converted into numerical embeddings via a chosen embedding model.

  2. Similarity Measurement: A measure such as cosine similarity determines how closely these embeddings resemble each other.

  3. Clustering Algorithm: An algorithm like K-means then groups similar embeddings into clusters. Each cluster represents a set of documents or chunks that resemble each other as much as possible.

It's a very handy approach if you're dealing with a large collection of diverse or difficult-to-read documents.

How I'm using clustering

Within my clusteriser tool, you can now set a few key parameters:

  • n_clusters (line 136): Determines the number of clusters to create.

  • max_tokens (line 88): Sets the maximum label length (in tokens) for each cluster name.

For more details, see the Clustering.md instructions.

Case study: Poorly readable legal text

As a test, I chose some difficult-to-read legal documents: PDF files from the Draft Budget section at the EUROPA website. This data amounted to roughly 27MB of PDFs. I vectorized the documents with a chunk size of 500 tokens and a maximum overlap of 50 tokens.

Splitting into 5 clusters (max. label length = 4 words):

Cluster Count
Draft Budget 2025 428
EU Budget Appropriations 1008
EU Budget Overview 1176
EU Regulations and Agencies 585
European Strategic Investments 758

Splitting into 25 clusters (max. label length = 6 words):

<
Cluster Count
Budget Appropriations Overview 217
Budget Appropriations Summary 230
Budget Estimates and Revenue Summary 116
Draft Budget 2025 Summary 176
EU Agricultural Regulations and Budget 157
EU Budget and Staff Expenditures 117
EU Budget and Staffing Overview 83
EU Budget Draft Summary 2025 298
EU Budget Preparatory Actions Summary 199
EU Draft Budget 2025 Summary 85
EU Financial and Procurement Regulations 230
EU Health and Safety Regulations 145
EU Macro-Financial Assistance Decisions 135
EU Pilot Projects and Budget Appropriations 107
Euratom Nuclear Research and Safety 69
EU Regional Development Fund Appropriations 302
EU Regulations on Migration and Borders 67
European Climate Action Initiatives 141
European Ombudsman Draft Budget 2025 135
European Space Programme Budget Overview 110
European Union Budget and Programs 233

With this kind of automated clustering, you can speed up navigation in extensive, unreadable, and unfamiliar texts—simply tweak the clustering parameters on your already-vectorized database.

Looking ahead

In upcoming blog posts, I'll share more on additional possibilities that vectorized data unlocks. If you'd like to know more, feel free to email me at jaro@nowapp.cz.