We describe a fault-tolerance technique for implementing operations in a large-scale distributed system that ensures that all the components will eventually have a consistent view of the system even in the face of component failures. To achieve this, we break the distributed operation into a series of smaller operations, each of which is local to a single component, carefully linked together. Thus, the effect of a component failure and restart in the middle of a multi-component operation is limited to that component and its immediate neighbors. This framework is used in System S, a commercial grade stream processing platform. In that context we will show empirically that our approach is effective and imposes low overhead on distributed inter-component operations. © 2011 ACM.