About cookies on this site Our websites require some cookies to function properly (required). In addition, other cookies may be used with your consent to analyze site usage, improve the user experience and for advertising. For more information, please review your options. By visiting our website, you agree to our processing of information as described in IBM’sprivacy statement. To provide a smooth navigation, your cookie preferences will be shared across the IBM web domains listed here.
Abstract
In this work we are concerned with the cost associated with replicating intermediate data for dataflows in Cloud environments. This cost is attributed to the extra resources required to create and maintain the additional replicas for a given data set. Existing data-analytic platforms such as Hadoop provide for fault-tolerance guarantee by relying on aggressive replication of intermediate data. We argue that the decision to replicate along with the number of replicas should be a function of the resource usage and utility of the data in order to minimize the cost of reliability. Furthermore, the utility of the data is determined by the structure of the dataflow and the reliability of the system. We propose a replication technique, which takes into account resource usage, system reliability and the characteristic of the dataflow to decide what data to replicate and when to replicate. The replication decision is obtained by solving a constrained integer programming problem given information about the dataflow up to a decision point. In addition, we built a working prototype, CARDIO of our technique which shows through experimental evaluation using a real testbed that finds an optimal solution. © 2012 IEEE.