Conference paper

Index design for structured documents based on abstraction


HTML has so far been the standard format for delivering information on the World Wide Web. However, automated information processing of these documents for data exchange and interoperability has been difficult. XML, a subset of SGML, has been proposed to be the next standard format, that allows user-defined tags for better describing nested document structures and associated semantics. Operations on structured documents, such as searching in nested document structures, require new functions that are not currently available on most systems today. We describe a general framework for manipulating structured documents based on document abstractions. An abstraction is an approximation of an actual document, while possessing useful properties for analyses of interest. The framework provides a wide design space for the tradeoff between cost and capability. This general framework can be applied to index design, document searching and categorization. We present this framework by focusing on the indexing and searching of structured documents in the XML domain, and prove their soundness. We also address the issue of rich data types in XML documents.
