Publication
Journal of Computational Biology
Paper

CASTOR: Clustering algorithm for sequence taxonomical organization and relationships

View publication

Abstract

Given a set of related proteins, two important problems in biology are the inference of protein subsets such that members of one subset share a common function and the identification of protein regions that possess functional significance. The former is typically approached by hierarchical bottom-up clustering based on pairwise sequence similarity and various linkage rules. The latter is typically approached in a supervised manner, based on global multiple sequence alignment. However, the two problems are inextricably linked, since functional subsets are usually characterized by distinctive functional regions. This paper introduces CASTOR, an automatic and unsupervised system that addresses both problems simultaneously and efficiently. It identifies protein regions that are likely to have functional significance by discovering and refining statistically significant motifs. It infers likely functional protein subsets and their relationships based on the presence of the discovered motifs in a top-down and recursive manner, allowing the identification of both hierarchical and nonhierarchical subset relationships. This is, to our knowledge, the first system that approaches both problems simultaneously in a top-down, systematic manner. CASTOR's performance is evaluated against the G-protein coupled receptor superfamily. The identified protein regions lead to a taxonomical organization of this superfamily that is in remarkable agreement with a biologically motivated one and which outperforms those produced by bottom-up clustering methods. We also find that conventional hierarchical representations may fail to accurately describe the complexity of evolutionary development responsible for the final organization of a complex protein family. In particular, many functional relationships governing distant subfamilies of such a protein family may not be represented hierarchically.

Date

Publication

Journal of Computational Biology

Authors

Topics

Share