Deep-learning algorithms are being used extensively in question-answering systems based on natural language classifiers to classify an incoming user question into a set of classes with the same answer. We treat a natural language classifier as a black box and study its performance with respect to the ground truth that is used to train and test the system. We have observed that maintaining ground truth is challenging; for example, 1) the number of answer classes can be large (in the several hundreds), 2) manual mapping of questions to answers can result in inconsistent mappings, leading to overlap and confusion among them, and 3) users ask questions within a context that is not apparent by examining the question standalone, leading to erroneous mappings. We propose a methodology for guided evolution of the ground truth, from its initial creation to its ongoing maintenance in the deployed production environment. We measure performance using two metrics: accuracy and confidence. Accuracy measures how many classifications are correct, based on an assessment, while confidence is a raw metric, output by the classifier, which correlates with accuracy. Confidence can further be used to effectively manage the perceived accuracy of the system from a user's perspective, appropriately trading off accuracy versus coverage.