Statistical significance of large gene clusters
Abstract
Consider the scenario of common gene clusters of closely related species where the cluster sizes could be as large as 400 from an alphabet of 25,000 genes. This paper addresses the problem of computing the statistical significance of such large clusters, whose individual elements occur with very low frequency (of the order of the number of species in this case) and the alphabet set of the elements is relatively large. We present a model where we study the structure of the clusters in terms of smaller nested (or otherwise) sub-clusters contained within the cluster. We give a probability estimation based on the expected cluster structure for such clusters (rather than some form of the product of individual probabilities of the elements). We also give an exact probability computation based on a dynamic programming algorithm, which runs in polynomial time. © 2007 Mary Ann Liebert, Inc.