Parallel Near-Duplicate Document Detection Using General-Purpose GPU

Dimitar Peshevski; Vladimir Zdraveski; Sashko Ristov

doi:10.31577/cai_2024_3_583

Authors

Dimitar Peshevski Faculty of Computer Science and Engineering, Ss. Cyril and Methodius University, Rugjer Boshkovikj 16, 1020, Skopje, North Macedonia
Vladimir Zdraveski Faculty of Computer Science and Engineering, Ss. Cyril and Methodius University, Rugjer Boshkovikj 16, 1020, Skopje, North Macedonia
Sashko Ristov Department of Computer Science, University of Innsbruck, Technikerstraße 21a, A - 6020, Innsbruck, Austria

DOI:

https://doi.org/10.31577/cai_2024_3_583

Keywords:

Near-duplicate, document, Shingling, similarity, locality-sensitive hashing, MinHash, fingerprint, parallelism, GPU, CUDA

Abstract

In today's data-rich world, one of the most significant challenges is efficiently identifying near-duplicate data, especially when integrating data from various sources. Identifying near-duplicate documents applies to any content and has been widely used to enhance the efficiency of search engines, identify plagiarism or spam, and so on. Even smaller or specialized search engines can benefit from knowledge about near-duplicate documents. Shingling and MinHash are two state-of-the-art approaches to detecting near-duplicate documents. However, there are not many attempts to improve the performance of this locality-sensitive hashing technique. In this research paper, we propose a parallel implementation of the MinHash algorithm for near-duplicate document detection utilizing the immense parallelism offered by general-purpose GPUs. Experimental results show that the GPU-based parallel solution is far more cost-effective than the CPU-based sequential and parallel solutions.

Downloads

Download data is not yet available.

Parallel Near-Duplicate Document Detection Using General-Purpose GPU

Authors

DOI:

Keywords:

Abstract

Downloads

Downloads

Published

How to Cite

Issue

Section

Information

Make a Submission

Keywords