Estimating accuracy for text classification tasks on large unlabeled data

Snigdha Chaturvedi; Tanveer A. Faruquie; L. Venkata Subramaniam; Mukesh K. Mohania

doi:10.1145/1871437.1871551

CIKM 2010

Workshop paper

01 Dec 2010

Estimating accuracy for text classification tasks on large unlabeled data

View publication

Abstract

Rule based systems for processing text data encode the knowledge of a human expert into a rule base to take decisions based on interactions of the input data and the rule base. Similarly, supervised learning based systems can learn patterns present in a given dataset to make decisions on similar and other related data. Performances of both these classes of models are largely dependent on the training examples seen by them, based on which the learning was performed. Even though trained models might fit well on training data, the accuracies they yield on a new test data may be considerably different. Computing the accuracy of the learnt models on new unlabeled datasets is a challenging problem requiring costly labeling, and which is still likely to only cover a subset of the new data because of the large sizes of datasets involved. In this paper, we present a method to estimate the accuracy of a given model on a new dataset without manually labeling the data. We verify our method on large datasets for two shallow text processing tasks: document classification and postal address segmentation, and using both supervised machine learning methods and human generated rule based models. © 2010 ACM.

Conference paper