Near Duplicates
Near duplicates occur when there are only minor differences between multiple versions of a document, such as:
- Files that use the same template (e.g. bills or order forms)
- The same file changed over time (drafts and final versions)
- Original and a forwarded version of the same email
This is an issue because as humans we group near-duplicates together for investigation purposes, but simple machine deduplication techniques will not.