GORDIAN: Efficient and Scalable Discovery of Composite Keys

Yannis Sismanis; Paul Brown; Peter J. Haas; Berthold Reinwald

VLDB 2006

Conference paper

12 Sep 2006

GORDIAN: Efficient and Scalable Discovery of Composite Keys

Abstract

Identification of (composite) key attributes is of fundamental importance for many different data management tasks such as data modeling, data integration, anomaly detection, query formulation, query optimization, and indexing. However, information about keys is often missing or incomplete in many real-world database scenarios. Surprisingly, the fundamental problem of automatic key discovery has received little attention in the existing literature. Existing solutions ignore composite keys, due to the complexity associated with their discovery. Even for simple keys, current algorithms take a brute-force approach; the resulting exponential CPU and memory requirements limit the applicability of these methods to small datasets. In this paper, we describe GORDIAN, a scalable algorithm for automatic discovery of keys in large datasets, including composite keys. GORDIAN can provide exact results very efficiently for both real-world and synthetic datasets. GORDIAN can be used to find (composite) key attributes in any collection of entities, e.g., key column-groups in relational data, or key leaf-node sets in a collection of XML documents with a common schema. We show empirically that GORDIAN can be combined with sampling to efficiently obtain high quality sets of approximate keys even in very large datasets.

Conference paper