About cookies on this site Our websites require some cookies to function properly (required). In addition, other cookies may be used with your consent to analyze site usage, improve the user experience and for advertising. For more information, please review your options. By visiting our website, you agree to our processing of information as described in IBM’sprivacy statement. To provide a smooth navigation, your cookie preferences will be shared across the IBM web domains listed here.
Publication
CIKM 2023
Demo paper
NumJoin: Discovering Numeric Joinable Tables with Semantically Related Columns
Abstract
Join discovery is a crucial part of exploration on data lakes. It often involves finding joinable tables that are semantically relevant. However, data lakes often contain numeric tables with unreliable column headers, and ID columns whose text names have been lost. Finding semantically relevant joins over numeric tables is a challenge. State-of-the-art describes join discovery using semantic similarity, but do not consider purely numeric tables. In this paper, we describe a system, NumJoin that includes two novel approaches for discovering joinable tables in a data lake: one that maps tables to knowledge graphs, and another that leverages numeric types and distributions. We demonstrate the effectiveness of NumJoin on a large data lake, including transportation data and finance data.