NumJoin: Discovering Numeric Joinable Tables with Semantically Related Columns
Abstract
Join discovery is a crucial part of exploration on data lakes. It often involves finding joinable tables that are semantically relevant. However, data lakes often contain numeric tables with unreliable column headers, and ID columns whose text names have been lost. Finding semantically relevant joins over numeric tables is a challenge. State-of-the-art describes join discovery using semantic similarity, but do not consider purely numeric tables. In this paper, we describe a system, NumJoin that includes two novel approaches for discovering joinable tables in a data lake: one that maps tables to knowledge graphs, and another that leverages numeric types and distributions. We demonstrate the effectiveness of NumJoin on a large data lake, including transportation data and finance data.