Near Duplicates

Near duplicates occur when there are only minor differences between multiple versions of a document, such as:

  • Files that use the same template (e.g. bills or order forms)
  • The same file changed over time (drafts and final versions)
  • Original and a forwarded version of the same email

This is an issue because as humans we group near-duplicates together for investigation purposes, but simple machine deduplication techniques will not.

If you are interested in this topic you can read our blog article on Advanced Deduplication, Twins beyond Fingerprints.