Document Clustering with Bursty Information

Authors

  • Apirak Hoonlor Faculty of ICT, Mahidol University, Bangkok
  • Bolesław K. Szymanski Rennsealer Polytechnic Institute, Troy, NY
  • Mohamed J. Zaki Computer Science Department, Rennsealer Polytechnic Institute, Troy,N.Y.
  • Vineet Chaoji Yahoo! Labs, Bangalore

Keywords:

Document clustering, bursty model, web mining

Abstract

Nowadays, almost all text corpora, such as blogs, emails and RSS feeds, are a collection of text streams. The traditional vector space model (VSM), or bag-of-words representation, cannot capture the temporal aspect of these text streams. So far, only a few bursty features have been proposed to create text representations with temporal modeling for the text streams. We propose bursty feature representations that perform better than VSM on various text mining tasks, such as document retrieval, topic modeling and text categorization. For text clustering, we propose a novel framework to generate bursty distance measure. We evaluated it on UPGMA, Star and K-Medoids clustering algorithms. The bursty distance measure did not only perform equally well on various text collections, but it was also able to cluster the news articles related to specific events much better than other models.

Downloads

Download data is not yet available.

Downloads

Published

2013-01-30

How to Cite

Hoonlor, A., Szymanski, B. K., Zaki, M. J., & Chaoji, V. (2013). Document Clustering with Bursty Information. Computing and Informatics, 31(6+), 1533–1555. Retrieved from http://147.213.75.17/ojs/index.php/cai/article/view/1330