Automated selection of blocking columns for record linkage

K. Hima Prasad; Snigdha Chaturvedi; Tanveer A. Faruquie; L. Venkata Subramaniam; Mukesh K. Mohania

doi:10.1109/SOLI.2012.6273508

SOLI 2012

Conference paper

12 Oct 2012

Automated selection of blocking columns for record linkage

View publication

Abstract

Record Linkage is an essential but expensive step in enterprise data management. In most deployments, blocking techniques are employed which can reduce the number of record pair comparisons and hence, the computational complexity of the task. Blocking algorithms require a careful selection of column(s) to be used for blocking. Selection of appropriate blocking column is critical to the accuracy and speed-up offered by the blocking technique and requires intervention by data quality practitioners who can exploit prior domain knowledge to analyse a small sample of the huge database and decide the blocking column(s). However, the selection of optimal blocking column(s) can depend heavily on the quality of data and requires extensive analysis. An experienced data quality practitioner is required for the selection of optimal blocking columns. In this paper, we present a data-driven approach to automatically choose blocking column(s), motivated from the modus operandi of data quality practitioners. Our approach produces a ranked list of columns by evaluating them for appropriateness for blocking on the basis of factors including data quality and distribution. We evaluate our choice of blocking columns through experiments on real world and synthetic datasets. We extend our approach to be employed in scenarios where more than one column can be used for blocking. © 2012 IEEE.

Conference paper