Bleach: A Distributed Stream Data Cleaning System
Existing scalable data cleaning approaches have focused on batch data cleaning. However, batch data cleaning is not suitable for streaming big data systems, in which dynamic data is generated continuously. Despite the increasing popularity of stream-processing systems, few stream data cleaning techniques have been proposed so far. In this paper, we bridge this gap by addressing the problem of rule-based stream data cleaning, which sets stringent requirements on latency, rule dynamics and ability to cope with the continuous nature of data streams. We design a system, called Bleach, which achieves real-time violation detection and data repair on a dirty data stream. Bleach relies on efficient, compact and distributed data structures to maintain the necessary state to repair data. Additionally, it supports rule dynamics and uses a 'cumulative' sliding window operation to improve cleaning accuracy. We evaluate a prototype of Bleach using both synthetic and real data streams and experimentally validate its high throughput, low latency and high cleaning accuracy, which are preserved even with rule dynamics.