About cookies on this site Our websites require some cookies to function properly (required). In addition, other cookies may be used with your consent to analyze site usage, improve the user experience and for advertising. For more information, please review your options. By visiting our website, you agree to our processing of information as described in IBM’sprivacy statement. To provide a smooth navigation, your cookie preferences will be shared across the IBM web domains listed here.
Publication
SOLI 2012
Conference paper
Automated selection of blocking columns for record linkage
Abstract
Record Linkage is an essential but expensive step in enterprise data management. In most deployments, blocking techniques are employed which can reduce the number of record pair comparisons and hence, the computational complexity of the task. Blocking algorithms require a careful selection of column(s) to be used for blocking. Selection of appropriate blocking column is critical to the accuracy and speed-up offered by the blocking technique and requires intervention by data quality practitioners who can exploit prior domain knowledge to analyse a small sample of the huge database and decide the blocking column(s). However, the selection of optimal blocking column(s) can depend heavily on the quality of data and requires extensive analysis. An experienced data quality practitioner is required for the selection of optimal blocking columns. In this paper, we present a data-driven approach to automatically choose blocking column(s), motivated from the modus operandi of data quality practitioners. Our approach produces a ranked list of columns by evaluating them for appropriateness for blocking on the basis of factors including data quality and distribution. We evaluate our choice of blocking columns through experiments on real world and synthetic datasets. We extend our approach to be employed in scenarios where more than one column can be used for blocking. © 2012 IEEE.