About cookies on this site Our websites require some cookies to function properly (required). In addition, other cookies may be used with your consent to analyze site usage, improve the user experience and for advertising. For more information, please review your options. By visiting our website, you agree to our processing of information as described in IBM’sprivacy statement. To provide a smooth navigation, your cookie preferences will be shared across the IBM web domains listed here.
Publication
BigData Congress 2017
Conference paper
Bleach: A Distributed Stream Data Cleaning System
Abstract
Existing scalable data cleaning approaches have focused on batch data cleaning. However, batch data cleaning is not suitable for streaming big data systems, in which dynamic data is generated continuously. Despite the increasing popularity of stream-processing systems, few stream data cleaning techniques have been proposed so far. In this paper, we bridge this gap by addressing the problem of rule-based stream data cleaning, which sets stringent requirements on latency, rule dynamics and ability to cope with the continuous nature of data streams. We design a system, called Bleach, which achieves real-time violation detection and data repair on a dirty data stream. Bleach relies on efficient, compact and distributed data structures to maintain the necessary state to repair data. Additionally, it supports rule dynamics and uses a 'cumulative' sliding window operation to improve cleaning accuracy. We evaluate a prototype of Bleach using both synthetic and real data streams and experimentally validate its high throughput, low latency and high cleaning accuracy, which are preserved even with rule dynamics.