Unleashing the Potential of Data Lakes with Semantic Enrichment using Foundation Models

Nandana Mihindukulasooriya; Sarthak Dash; Sugato Bagchi; Md Faisal Mahbub Chowdhury; Gaetano Rossiello; Alfio Gliozzo; Ariel Farkash; Michael Glass; Igor Gokhman; Oktie Hassanzadeh; Nhan Pham; Boris Rozenberg; Yehoshua Sagron; Shankar Subramaniam; Toshihiro Takahashi; Takaaki Tateishi; Long Vu

ISWC 2023

Short paper

06 Nov 2023

Unleashing the Potential of Data Lakes with Semantic Enrichment using Foundation Models

Abstract

Nowadays, most organizations are managing data lakes containing heterogeneous data from various sources. However, the lack of adequate metadata often transforms these data lakes into data swamps, making it challenging to locate relevant data for critical organizational tasks and consequently limiting their utility. Recent advancements in large language models and foundation models have enabled the automation of metadata generation using generative AI models and the use of generated metadata for mapping tabular data into semantically richer glossaries, taxonomies, or ontologies. In this talk, we will present a semantic enrichment process that generates table metadata such as descriptive table captions, tags, expanded column names, and column descriptions and then use that information to map table columns to concepts in a given business glossary or ontology. Furthermore, during this process, we represent both table metadata and business glossaries as knowledge graphs and connect them by mapping columns to business concepts. As a result, the enrichment process makes the data in data lakes more meaningful to the organization and enhances downstream tasks, including improved table search and discovery, efficient table joins, and advanced business analytics.

Conference paper