Parallel Near-Duplicate Document Detection Using General-Purpose GPU

Authors

  • Dimitar Peshevski Faculty of Computer Science and Engineering, Ss. Cyril and Methodius University, Rugjer Boshkovikj 16, 1020, Skopje, North Macedonia
  • Vladimir Zdraveski Faculty of Computer Science and Engineering, Ss. Cyril and Methodius University, Rugjer Boshkovikj 16, 1020, Skopje, North Macedonia
  • Sashko Ristov Department of Computer Science, University of Innsbruck, Technikerstraße 21a, A - 6020, Innsbruck, Austria

DOI:

https://doi.org/10.31577/cai_2024_3_583

Keywords:

Near-duplicate, document, Shingling, similarity, locality-sensitive hashing, MinHash, fingerprint, parallelism, GPU, CUDA

Abstract

In today's data-rich world, one of the most significant challenges is efficiently identifying near-duplicate data, especially when integrating data from various sources. Identifying near-duplicate documents applies to any content and has been widely used to enhance the efficiency of search engines, identify plagiarism or spam, and so on. Even smaller or specialized search engines can benefit from knowledge about near-duplicate documents. Shingling and MinHash are two state-of-the-art approaches to detecting near-duplicate documents. However, there are not many attempts to improve the performance of this locality-sensitive hashing technique. In this research paper, we propose a parallel implementation of the MinHash algorithm for near-duplicate document detection utilizing the immense parallelism offered by general-purpose GPUs. Experimental results show that the GPU-based parallel solution is far more cost-effective than the CPU-based sequential and parallel solutions.

Downloads

Download data is not yet available.

Downloads

Published

2024-06-24

How to Cite

Peshevski, D., Zdraveski, V., & Ristov, S. (2024). Parallel Near-Duplicate Document Detection Using General-Purpose GPU. Computing and Informatics, 43(3), 583–610. https://doi.org/10.31577/cai_2024_3_583