Distributed middleware reliability and fault tolerance support in system S

Rohit Wagle; Henrique Andrade; Kirsten Hildrum; Chitra Venkatramani; Michael Spicer

doi:10.1145/2002259.2002304

DEBS 2011

Conference paper

26 Aug 2011

Distributed middleware reliability and fault tolerance support in system S

View publication

Abstract

We describe a fault-tolerance technique for implementing operations in a large-scale distributed system that ensures that all the components will eventually have a consistent view of the system even in the face of component failures. To achieve this, we break the distributed operation into a series of smaller operations, each of which is local to a single component, carefully linked together. Thus, the effect of a component failure and restart in the middle of a multi-component operation is limited to that component and its immediate neighbors. This framework is used in System S, a commercial grade stream processing platform. In that context we will show empirically that our approach is effective and imposes low overhead on distributed inter-component operations. © 2011 ACM.

Conference paper