A new two-phase sampling based algorithm for discovering association rules

Bin Chen; Peter Haas; Peter Scheuermann

doi:10.1145/775107.775114

KDD 2002

Conference paper

23 Jul 2002

A new two-phase sampling based algorithm for discovering association rules

View publication

Abstract

This paper introduces FAST, a novel two-phase sampling-based algorithm for discovering association rules in large databases. In Phase I a large initial sample of transactions is collected and used to quickly and accurately estimate the support of each individual item in the database. In Phase II these estimated supports are used to either trim "outlier" transactions or select "representative" transactions from the initial sample, thereby forming a small final sample that more accurately reflects the statistical characteristics (i.e., itemset supports) of the entire database. The expensive operation of discovering association rules is then performed on the final sample. In an empirical study, FAST was able to achieve 90-95% accuracy using a final sample having a size of only 15-33% of that of a comparable random sample. This efficiency gain resulted in a speedup by roughly a factor of 10 over previous algorithms that require expensive processing of the entire database - even efficient algorithms that exploit sampling. Our new sampling technique can be used in conjunction with almost any standard association-rule algorithm, and can potentially render scalable other algorithms that mine "count" data.

Paper