FLASH: a fast look-up algorithm for string homology
Abstract
We are in the middle of a long-range worldwide race to map and sequence the genome of Homo sapiens and that of many other living creatures. About 108 nucleotides and aminoacids of mammals, primates, rodents, bacteria, and other life forms have already been classified and stored in publicly available database such as Genbank. By the end of the century we should be close to the estimated mark of 1 billion nucleotides. A key issue in managing such large amounts of data is the availability of efficient, accurate, and selective techniques to detect homology (similarity) between newly recovered and previously acquired sequences. Unfortunately, even today's most advanced algorithms such as Smith-Waterman, FASTA and BLAST, are designed to scan the contents of the entire database for one or more matches. This results in long search times or in a sharp tradeoff between accuracy and amount of computation. The algorithm we present here is based on a probabilistic indexing framework which requires minimal access to the database for such match. A highly redundant number of descriptive tuples from the sequences of interest are generated and used as indices in table look-up paradigm. Theoretical and experimental results on the sensitivity and accuracy of the approach are provided. This includes the probability of correct and random matches and the storage and computational requirements. An experimental system has been implemented for a database containing the complete genome of the bacteria E.Coli (approximately 2 million nucleotides). The system is being expanded to include the complete Genbank database. Search time is of a few seconds on a workstation class machine. The algorithm is shown to scale well to databases containing billions of nucleotides with performances that are orders of magnitude better even than BLAST, the fastest of the current techniques. The approach is of a very general nature and is being applied to a large number of other domains and data topologies such as speech recognition, sound and image databases.