Document Clustering with Bursty Information

Apirak Hoonlor; Bolesław K. Szymanski; Mohamed J. Zaki; Vineet Chaoji

Authors

Apirak Hoonlor Faculty of ICT, Mahidol University, Bangkok
Bolesław K. Szymanski Rennsealer Polytechnic Institute, Troy, NY
Mohamed J. Zaki Computer Science Department, Rennsealer Polytechnic Institute, Troy,N.Y.
Vineet Chaoji Yahoo! Labs, Bangalore

Keywords:

Document clustering, bursty model, web mining

Abstract

Nowadays, almost all text corpora, such as blogs, emails and RSS feeds, are a collection of text streams. The traditional vector space model (VSM), or bag-of-words representation, cannot capture the temporal aspect of these text streams. So far, only a few bursty features have been proposed to create text representations with temporal modeling for the text streams. We propose bursty feature representations that perform better than VSM on various text mining tasks, such as document retrieval, topic modeling and text categorization. For text clustering, we propose a novel framework to generate bursty distance measure. We evaluated it on UPGMA, Star and K-Medoids clustering algorithms. The bursty distance measure did not only perform equally well on various text collections, but it was also able to cluster the news articles related to specific events much better than other models.

Downloads

Download data is not yet available.

Document Clustering with Bursty Information

Authors

Keywords:

Abstract

Downloads

Downloads

Published

How to Cite

Issue

Section

Information

Make a Submission

Keywords