Publication
VLDB 2005
Conference paper

Parameter free bursty events detection in text streams

Abstract

Text classification is a major data mining task. An advanced text classification technique is known as partially supervised text classification, which can build a text classifier using a small set of positive examples only. This leads to our curiosity whether it is possible to find a set of features that can be used to describe the positive examples. Therefore, users do not even need to specify a set of positive examples. As the first step, in this paper, we formalize it as a new problem, called hot bursty events detection, to detect bursty events from a text stream which is a sequence of chronologically ordered documents. Here, a bursty event is a set of bursty features, and is considered as a potential category to build a text classifier. It is important to know that the hot bursty events detection problem, we study in this paper, is different from TDT (topic detection and tracking) which attempts to cluster documents as events using clustering techniques. In other words, our focus is on detecting a set of bursty features for a bursty event. In this paper, we propose a new novel parameter free probabilistic approach, called feature-pivot clustering. Our main technique is to fully utilize the time information to determine a set of bursty features which may occur in different time windows. We detect bursty events based on the feature distributions. There is no need to tune or estimate any parameters. We conduct experiments using real life data, a major English newspaper in Hong Kong, and show that the parameter free feature-pivot clustering approach can detect the bursty events with a high success rate.

Date

Publication

VLDB 2005

Authors

Share