About cookies on this site Our websites require some cookies to function properly (required). In addition, other cookies may be used with your consent to analyze site usage, improve the user experience and for advertising. For more information, please review your options. By visiting our website, you agree to our processing of information as described in IBM’sprivacy statement. To provide a smooth navigation, your cookie preferences will be shared across the IBM web domains listed here.
Publication
HT 2018
Conference paper
Joint distributed representation of text and structure of semi-structured documents
Abstract
Majority of textual data over web is in the form of semi-structured documents. Thus, structural skeleton of such documents plays important role in determining the semantics of the data content. Presence of structure sometimes allows us to write simple rules to extract such information, but it may not be always possible due to flexibility in the structure and the frequency with which such structures are altered. In this paper, we propose a joint modeling of text and the associated structure to effectively capture the semantics of the semistructure documents. The model simultaneously learns the dense continuous representation for word tokens and the structure associated with them.We utilize the context of structures for projection such that similar structures containing semantically similar topics are close to each other in vector space. We explore two semantic text mining tasks over web data to test the effectiveness of our representation viz., document similarity, and table semantic component identification. In context of traditional rule-based approaches, both these tasks demand rich, domain-specific knowledge sources, homogeneous schema for the documents, and rules that capture the semantics. On the other hand, our approach is unsupervised and resource conscious in nature. Despite of working without knowledge resources and large training data, it performs at par with state-of-the-art rule based and other unsupervised approaches.