LSIF: A system for large-scale Information flow detection based on topic-related semantic similarity measurement

Meng Zhao; Hao Wang; Liangliang Cao; Chen Zhang; Hongzhi Yin; Fanjiang Xu

doi:10.1109/WI-IAT.2015.2

WI-IAT Workshops 2015

Conference paper

02 Feb 2016

LSIF: A system for large-scale Information flow detection based on topic-related semantic similarity measurement

View publication

Abstract

Information flow detection is dedicated to tracking the dynamics and evolution of Web information spreading across the entire web over time. How to choose a comfortable information granularity to detect and how to track information evolution from one to another are the main challenges. Besides, the technological problem of doing that with a large scale information efficiently is yet to be solved. In this paper, we propose a system approach (LSIF) for a large-scale topic-related semantic information flow detection. We view the sentence as the basic information unit. Moreover, we represent a word or a sentence as continuous high-dimensional vector, which is used for semantic similarity measurement, with the help of word embedding and Fisher kernel. To handle the large-scale information efficiently, we propose a dimension reduction framework called Random Reference Reduction (3R). Furthermore, we adopt a novel clustering algorithm to extract meme-a piece of information and its variants and analyze how memes evolve. We demonstrate the effectiveness of our approach on two terabyte-level datasets. One is the dataset used by some previous researchers, on which we conducted a series of experiments to evaluate performance. The result shows that our approach is more effective and more efficient comparing with the state-of-the-art methods. The other one is 5 terabyte dataset crawled from 20 Chinese news sites. We visualize the detection results of information flow and exact 9 million memes from the Chinese dataset, which spend about two days.

Conference paper