About cookies on this site Our websites require some cookies to function properly (required). In addition, other cookies may be used with your consent to analyze site usage, improve the user experience and for advertising. For more information, please review your options. By visiting our website, you agree to our processing of information as described in IBM’sprivacy statement. To provide a smooth navigation, your cookie preferences will be shared across the IBM web domains listed here.
Publication
SODA 2000
Conference paper
Pattern discovery on character sets and real-valued data: Linear bound on irredundant motifs and an efficient polynomial time algorithm
Abstract
Given an input sequence of data, a motif is a repeating pattern, possibly interspersed with `dont care' characters. The data could be a sequence of characters or sets of characters or even real values. In the first two cases, the number of motifs could potentially be exponential in the size of the input sequence and in the third case there could be infinite number of motifs (assuming two real numbers are equal if they are within some specified δ>0 of each other). However, it has been shown in some motif based applications, such as multiple sequence alignment [Par98, PFR99], that a small subset of these motifs capture all the relevant information. Also, from an information-theoretic viewpoint, the special motifs represent the sets with low co-relation, hence of higher informational value. In this paper, we show, for any sequence with n characters, there exists only a linear (or no more than 3 n) number of these special motifs and every other motif can be constructed from this set. We term these special motifs as irredundant motifs. We also present an efficient polynomial time algorithm to generate these motifs. This bound on the number of useful motifs gives validation to motif-based approaches, since the total number of irredundant motifs does not explode. This result is potentially of significance to most applications that use pattern discovery as the basic engine such as data mining, clustering, matching and others.