Joint distributed representation of text and structure of semi-structured documents

Abhishek Laddha; Salil Joshi; Samiulla Shaikh; Sameep Mehta

doi:10.1145/3209542.3209551

HT 2018

Conference paper

03 Jul 2018

Joint distributed representation of text and structure of semi-structured documents

View publication

Abstract

Majority of textual data over web is in the form of semi-structured documents. Thus, structural skeleton of such documents plays important role in determining the semantics of the data content. Presence of structure sometimes allows us to write simple rules to extract such information, but it may not be always possible due to flexibility in the structure and the frequency with which such structures are altered. In this paper, we propose a joint modeling of text and the associated structure to effectively capture the semantics of the semistructure documents. The model simultaneously learns the dense continuous representation for word tokens and the structure associated with them.We utilize the context of structures for projection such that similar structures containing semantically similar topics are close to each other in vector space. We explore two semantic text mining tasks over web data to test the effectiveness of our representation viz., document similarity, and table semantic component identification. In context of traditional rule-based approaches, both these tasks demand rich, domain-specific knowledge sources, homogeneous schema for the documents, and rules that capture the semantics. On the other hand, our approach is unsupervised and resource conscious in nature. Despite of working without knowledge resources and large training data, it performs at par with state-of-the-art rule based and other unsupervised approaches.

Conference paper