Clustering – A Powerful Tool for Categorization

Clustering – A Powerful Tool for Categorization
I recently extended my document vectorization tool to include clustering capabilities. But what exactly is clustering, and why might you need it? Let's have our AI explain:
Clustering in a vector database is a way to automatically group similar vectors (for instance, vectors derived from documents or images) based on their similarity. This approach helps identify topics or themes in large amounts of data and makes searching far more efficient—queries can focus on the relevant cluster rather than the entire dataset.
How does it work?
-
Vectorization: First, the raw content (e.g., PDF text) is split into manageable chunks and converted into numerical embeddings via a chosen embedding model.
-
Similarity Measurement: A measure such as cosine similarity determines how closely these embeddings resemble each other.
-
Clustering Algorithm: An algorithm like K-means then groups similar embeddings into clusters. Each cluster represents a set of documents or chunks that resemble each other as much as possible.
It's a very handy approach if you're dealing with a large collection of diverse or difficult-to-read documents.
How I'm using clustering
Within my clusteriser tool, you can now set a few key parameters:
-
n_clusters (line 136): Determines the number of clusters to create.
-
max_tokens (line 88): Sets the maximum label length (in tokens) for each cluster name.
For more details, see the Clustering.md instructions.
Case study: Poorly readable legal text
As a test, I chose some difficult-to-read legal documents: PDF files from the Draft Budget section at the EUROPA website. This data amounted to roughly 27MB of PDFs. I vectorized the documents with a chunk size of 500 tokens and a maximum overlap of 50 tokens.
Splitting into 5 clusters (max. label length = 4 words):
Cluster | Count |
---|---|
Draft Budget 2025 | 428 |
EU Budget Appropriations | 1008 |
EU Budget Overview | 1176 |
EU Regulations and Agencies | 585 |
European Strategic Investments | 758 |
Splitting into 25 clusters (max. label length = 6 words):
<Cluster | Count |
---|---|
Budget Appropriations Overview | 217 |
Budget Appropriations Summary | 230 |
Budget Estimates and Revenue Summary | 116 |
Draft Budget 2025 Summary | 176 |
EU Agricultural Regulations and Budget | 157 |
EU Budget and Staff Expenditures | 117 |
EU Budget and Staffing Overview | 83 |
EU Budget Draft Summary 2025 | 298 |
EU Budget Preparatory Actions Summary | 199 |
EU Draft Budget 2025 Summary | 85 |
EU Financial and Procurement Regulations | 230 |
EU Health and Safety Regulations | 145 |
EU Macro-Financial Assistance Decisions | 135 |
EU Pilot Projects and Budget Appropriations | 107 |
Euratom Nuclear Research and Safety | 69 |
EU Regional Development Fund Appropriations | 302 |
EU Regulations on Migration and Borders | 67 |
European Climate Action Initiatives | 141 |
European Ombudsman Draft Budget 2025 | 135 |
European Space Programme Budget Overview | 110 |
European Union Budget and Programs | 233 |
With this kind of automated clustering, you can speed up navigation in extensive, unreadable, and unfamiliar texts—simply tweak the clustering parameters on your already-vectorized database.
Looking ahead
In upcoming blog posts, I'll share more on additional possibilities that vectorized data unlocks. If you'd like to know more, feel free to email me at jaro@nowapp.cz.