Qing Wang, Tao Li, et al.
SDM 2018
Although considerable research has been conducted in the field of hierarchical text categorization, little has been done on automatically collecting labeled corpus for building hierarchical taxonomies. In this paper, we propose an automatic method of collecting training samples to build hierarchical taxonomies. In our method, the category node is initially defined by some keywords, the web search engine is then used to construct a small set of labeled documents, and a topic tracking algorithm with keyword-based content normalization is applied to enlarge the training corpus on the basis of the seed documents. We also design a method to check the consistency of the collected corpus. The above steps produce a flat category structure which contains all the categories for building the hierarchical taxonomy. Next, linear discriminant projection approach is utilized to construct more meaningful intermediate levels of hierarchies in the generated flat set of categories. Experimental results show that the training corpus is good enough for statistical classification methods. © 2007 Springer Science+Business Media, LLC.
Qing Wang, Tao Li, et al.
SDM 2018
Wei Peng, Chang-Shing Perng, et al.
KDD 2007
Hong Lei Guo, Li Zhang, et al.
EMNLP 2006
Jia Chen, Qin Jin, et al.
ACM TOIS