Publication
SDM 2006
Conference paper

Representation is everything: Towards efficient and adaptable similarity measures for biological data

View publication

Abstract

Distance function computation is an important operation in biological data mining. This is because the notion of similarity is often required as a key subroutine in many data mining algorithms such as clustering, search or indexing. Furthermore, since such applications typically require repeated distance function computation, it is desirable to perform this task as efficiently as possible. A wide variety of alignment functions have been designed in the literature in order to determine similarity between biological strings. Typically, most of these methods such as the edit distance or the Smith-Waterman algorithm rely on either a local or global form of alignment between the two string representations of the biological data. Yet, many of these methods suffer from a variety of drawbacks both in terms of effectiveness and efficiency in measuring similarity. In this paper, we examine the key qualitative issues in measurement of similarity among distance functions, and question the fundamental assumptions in using alignment measures over string representations of the biological data. We argue that the alignment approach is rigid and brittle, ignores several relevant compositional characteristics of the strings, is not easily trainable from examples, and is inherently resistant to an efficient data mining process from a representational point of view. We propose a fundamental overhaul in the methods for similarity computation of biological data by changing the underlying representation of the data. Our proposed methodology (PROSAC) uses PRObabilistic SAmpling for similarity Computations, and results in a new representation of the strings at a higher level of abstraction; yet the abstraction process has the effect of removing the noise in the distance measurements of precise alignment techniques. In addition, some forms of string representation in the PROSAC approach map to well known representational and distance computation methods in the data mining field, and therefore existing data mining software can be used directly and efficiently on the new representation. We also show that the approach is amenable to the use of supervision for parametric determination of efficient distance functions for particular tasks in biological analysis.

Date

Publication

SDM 2006

Authors

Share