A bag of paths model for measuring structural similarity in Web documents

Sachindra Joshi; Neeraj Agrawal; Raghu Krishnapuram; Sumit Negi

doi:10.1145/956750.956822

KDD 2003

Conference paper

01 Dec 2003

A bag of paths model for measuring structural similarity in Web documents

View publication

Abstract

Structural information (such as layout and look-and-feel) has been extensively used in the literatuce for extraction of interesting or relevant data, efficient storage, and query optimization. Traditionally, tree models (such as DOM trees) have been used to represent structural information, especially in the case of HTML and XML documents. However, computation of structural similarity between documents based on the tree model is computationally expensive. In this paper, we propose an alternative scheme for representing the structural information of documents based on the paths contained in the corresponding tree model. Since the model includes partial information about parents, children and siblings, it allows us to define a new family of meaningful (and at the same time computationally simple) structural similarity measures. Our experimental results based on the SIGMOD XML data set as well as HTML document collections from ibm.com, dell.com, and amazon.com show that the representation is powerful enough to produce good clusters of structurally similar pages. Copyright 2003 ACM.

Conference paper