About cookies on this site Our websites require some cookies to function properly (required). In addition, other cookies may be used with your consent to analyze site usage, improve the user experience and for advertising. For more information, please review your options. By visiting our website, you agree to our processing of information as described in IBM’sprivacy statement. To provide a smooth navigation, your cookie preferences will be shared across the IBM web domains listed here.
Publication
SOLI 2012
Conference paper
Managing data quality by identifying the noisiest data samples
Abstract
Enterprise datasets are often noisy. Several columns can have non-standard, erroneous or missing information. Poor quality data can lead to incorrect reporting and wrong conclusions being drawn. Data cleansing involves standardizing such data to improve its quality. Often data cleansing tasks involve writing rules manually. The step involves understanding the data quality issues and then writing data transformation rules to correct these issues. This is a human intensive task. In this study we propose a method to identify noisy subsets of huge unlabelled textual datasets. This is a two step process where in the first step we develop an estimation tool to predict the data quality on an unlabelled text dataset as produced by a segmentation model. The accuarcy of the proposed method is shown on a real life dataset. © 2012 IEEE.