Toward query-friendly compression of rapid graph streams
We study the problem of summary construction of massive graph streams arriving in real time. Many graphs such as those formed by the activity on social networks, communication networks, and telephone networks are defined dynamically as rapid edge streams on a massive domain of nodes. In these rapid and massive graph streams, it is often not possible to estimate the frequency of individual items (e.g., edges, nodes) with complete accuracy. Nevertheless, sketch-based stream summaries such as Count-Min can preserve frequency information of high-frequency items with a reasonable accuracy. However, these sketches lose the underlying graph structure unless one keeps information about start and end nodes of all edges, which is prohibitively expensive. For example, the existing methods can identify the high-frequency nodes and edges, but they are unable to answer more complex structural queries such as reachability defined by high-frequency edges. To this end, we design a three-dimensional sketch, gMatrix that summarizes massive graph streams in real time, while also retaining information about the structural behavior of the underlying graph dataset. We demonstrate how gMatrix, coupled with a one-time reverse hash mapping, is able to estimate important structural properties, e.g., reachability over high-frequency edges in an online manner and with theoretical performance guarantees. Our experimental results using large-scale graph streams attest that gMatrix is capable of answering both frequency-based and structural queries with high accuracy and efficiency.