Detecting Similarities in Vectorized Documents

11.04.2025

One particularly useful technique when working with vectorized documents is similarity detection—in simpler terms, identifying when an author may have copied or heavily borrowed from another source. Whether it's a literal copy/paste from another document or paraphrased content, semantic embeddings allow us to catch it.

How Similar is "Too Similar"?

The degree of similarity between two text chunks is expressed as a number—typically, the closer the number is to -1, the more similar the texts are. What's great is that to perform this kind of search, you don't need AI at all. A simple SQL SELECT will do the trick.

Here's an example:

SELECT a.cluster AS cluster_a,
    a.file AS file_a,
    a.page AS page_a,
    a.position AS position_a,
    b.cluster AS cluster_b,
    b.file AS file_b,
    b.page AS page_b,
    b.position AS position_b,
    a.embedding <#> b.embedding AS similarity,
    a.text_chunk AS text_chunk_a,
    b.text_chunk AS text_chunk_b
FROM embeddings a, embeddings b
WHERE a.id < b.id
ORDER BY similarity ASC
LIMIT 20;

This query surfaces the top 20 most similar text pairs in your dataset, based on vector distance.

Example of the this SELECT operation applied to the documents mentioned in the previous blog post. 

cluster_a file_a page_a position_a cluster_b file_b page_b position_b similarity
EU Budget Appropriations SEC03.pdf 831 1 EU Budget Appropriations SEC03.pdf 825 1 -0.9989223480224609
Draft Budget 2025 SEC03.pdf 1585 1 Draft Budget 2025 SEC03.pdf 1431 1 -0.992438793182373
Draft Budget 2025 SEC03.pdf 1587 1 Draft Budget 2025 SEC03.pdf 1433 1 -0.9922237992286682
EU Regulations and Agencies SEC03.pdf 231 4 EU Regulations and Agencies SEC03.pdf 241 3 -0.9897857904434204
EU Budget Appropriations SEC03.pdf 1041 2 EU Budget Appropriations SEC03.pdf 1017 2 -0.9887492656707764
Draft Budget 2025 SEC03.pdf 1589 1 Draft Budget 2025 SEC03.pdf 1435 1 -0.9879744648933411
EU Budget Overview SEC00.pdf 3 1 EU Budget Overview GenExp.pdf 3 1 -0.9874448776245117
Draft Budget 2025 SEC09.pdf 7 1 Draft Budget 2025 SEC08.pdf 7 1 -0.986766517162323
EU Budget Overview SEC03.pdf 639 1 EU Budget Overview SEC03.pdf 297 1 -0.9862082004547119
Draft Budget 2025 SEC03.pdf 1439 1 Draft Budget 2025 SEC03.pdf 1593 1 -0.9858852624893188
EU Budget Appropriations SEC02.pdf 67 1 EU Budget Appropriations SEC10.pdf 65 1 -0.985404908657074
Draft Budget 2025 SEC07.pdf 9 1 Draft Budget 2025 SEC01.pdf 11 1 -0.984980583190918
Draft Budget 2025 SEC02.pdf 9 2 Draft Budget 2025 SEC10.pdf 9 2 -0.9844917058944702
EU Budget Appropriations SEC10.pdf 9 1 EU Budget Appropriations SEC07.pdf 11 1 -0.9843403100967407
EU Budget Appropriations SEC03.pdf 1091 2 EU Budget Appropriations SEC03.pdf 1041 2 -0.9831921458244324
Draft Budget 2025 SEC03.pdf 1445 1 Draft Budget 2025 SEC03.pdf 1599 1 -0.9828660488128662
European Strategic Investments SEC03.pdf 917 3 European Strategic Investments SEC03.pdf 915 3 -0.9826010465621948
Draft Budget 2025 SEC02.pdf 11 1 Draft Budget 2025 SEC10.pdf 11 1 -0.9824360609054565
EU Regulations and Agencies SEC03.pdf 479 3 EU Regulations and Agencies LR01.pdf 115 6 -0.9824023842811584
EU Budget Overview SEC02.pdf 1 1 EU Budget Overview SEC01.pdf 1 1 -0.9817304611206055

When Does Similarity Matter?

The importance of these findings depends on the nature of your documents. For example:

  • In financial reports or structured tables, a high number of similar passages might be perfectly normal.

  • In academic papers or theses, it could be a red flag for plagiarism.

Zooming In: Similarity Within Clusters

You can refine your approach further by restricting similarity checks to within the same cluster. Here's how to count how many high-similarity text pairs exist per cluster:

SELECT
    a.cluster,
    COUNT(*) AS count_above_threshold
FROM embeddings a
JOIN embeddings b ON a.id < b.id AND a.cluster = b.cluster
WHERE a.embedding <#> b.embedding < -0.98 -- set your threshold GROUP BY a.cluster
ORDER BY count_above_threshold DESC;

Example of the this SELECT operation applied to the documents mentioned in the previous blog post.

cluster count_above_threshold
Draft Budget 2025 12
EU Budget Appropriations 7
EU Budget Overview 3
EU Regulations and Agencies 2
European Strategic Investments 1

This helps you identify clusters with unusually high internal similarity—potentially indicating redundancy or heavy reuse of content.

No AI Needed

These examples highlight how you can work with vectorized text data without invoking AI models for every query. It's a fast, transparent, and surprisingly powerful approach.

If you have ideas, use cases, or questions, I'd love to hear from you.

Happy text mining! 🧠📊

Would you like me to add visuals or code snippets for sharing on LinkedIn too?