Intelligent Crawling on the World Wide Web with Arbitrary Predicates

Charu C. Aggarwal; Fatima Al-Garawi; Philip S. Yu

doi:10.1145/371920.371955

WWW 2001

Conference paper

01 Apr 2001

Intelligent Crawling on the World Wide Web with Arbitrary Predicates

View publication

Abstract

The enormous growth of the world wide web in recent years has made it important to perform resource discovery efficiently. Consequently, several new ideas have been pro-posed in recent years; among them a key technique is focused crawling which is able to crawl particular topical portions of the world wide web quickly without having to explore all web pages. In this paper, we propose the novel concept of intelligent crawling which actually learns characteristics of the linkage structure of the world wide web while perform-ing the crawling. Specifically, the intelligent crawler uses the inlinking web page content, candidate URL structure, or other behaviors of the inlinking web pages or siblings in order to estimate the probability that a candidate is use-ful for a given crawl. This is a much more general frame-work than the focused crawling technique which is based on a pre-defined understanding of the topical structure of the web. The techniques discussed in this paper are applicable for crawling web pages which satisfy arbitrary user-defined predicates such as topical queries, keyword queries or any combinations of the above. Unlike focused crawling, it is not necessary to provide representative topical examples, since the crawler can learn its way into the appropriate topic. We refer to this technique as intelligent crawling because of its adaptive nature in adjusting to the web page linkage structure. The learning crawler is capable of reusing the knowledge gained in a given crawl in order to provide more e cient crawling for closely related predicates.

Paper