The complexity of mining maximal frequent subgraphs

Benny Kimelfeld; Phokion G. Kolaitis

doi:10.1145/2629550

ACM TODS

Paper

30 Dec 2014

The complexity of mining maximal frequent subgraphs

View publication

Abstract

A frequent subgraph of a given collection of graphs is a graph that is isomorphic to a subgraph of at least as many graphs in the collection as a given threshold. Frequent subgraphs generalize frequent itemsets and arise in various contexts, from bioinformatics to the Web. Since the space of frequent subgraphs is typically extremely large, research in graph mining has focused on special types of frequent subgraphs that can be orders of magnitude smaller in number, yet encapsulate the space of all frequent subgraphs. Maximal frequent subgraphs (i.e., the ones not properly contained in any frequent subgraph) constitute the most useful such type. In this article, we embark on a comprehensive investigation of the computational complexity of mining maximal frequent subgraphs. Our study is carried out by considering the effect of three different parameters: possible restrictions on the class of graphs; a fixed bound on the threshold; and a fixed bound on the number of desired answers.We focus on specific classes of connected graphs: general graphs, planar graphs, graphs of bounded degree, and graphs of bounded treewidth (trees being a special case). Moreover, each class has two variants: that in which the nodes are unlabeled, and that in which they are uniquely labeled. We delineate the complexity of the enumeration problem for each of these variants by determining when it is solvable in (total or incremental) polynomial time and when it is NP-hard. Specifically, for the labeled classes, we show that bounding the threshold yields tractability but, in most cases, bounding the number of answers does not, unless P=NP; an exception is the case of labeled trees, where bounding either of these two parameters yields tractability. The state of affairs turns out to be quite different for the unlabeled classes. The main (and most challenging to prove) result concerns unlabeled trees: we show NP-hardness, even if the input consists of two trees and both the threshold and the number of desired answers are equal to just two. In other words, we establish that the following problem is NP-complete: given two unlabeled trees, do they have more than one maximal subtree in common?

Conference paper