Merging high-performance computing and Cloud

The “as-a-service” paradigm

Overview

The landscape of modern high-performance computing (HPC) is rapidly changing. Whereas performance and scalability remain the main goals of HPC, other factors are becoming prominent, including

  • Performance under stringent power constraints for exascale computing,
  • Introduction of big data analytics and AI technologies into HPC workflows,
  • Requirements for better usability, portability and reproducibility of workflows,
  • Merging of native HPC and Cloud approaches in simulations,
  • Proliferation of heterogeneous and data-centric architectures.

All of the above necessitate the development of new approaches to computing that can improve the user experience while leveraging cutting-edge technological developments.

Bioinformatics as a service

Genomics and related technologies, collectively known as “omics”, have transformed life sciences research. These technologies produce mountains of data that need to be managed and analysed. Rapid developments in next-generation sequencing technologies have helped genomics become mainstream, but the compute support systems, meant to enable genomics, have lagged behind. As genomics is making inroads into personalised healthcare and clinical settings, it is paramount that a robust compute infrastructure be designed to meet the growing needs of the field. Infrastructure design to deal with omics datasets is an active and critical area of research that has an important role to play in the adaptation of omics in industrial healthcare and clinical settings.

We propose an as-a-service compute infrastructure for fast and scalable processing of omics datasets. Our solution addresses three fundamental principles of scalability, portability and reproducibility, which are essential for the acceptance of results in a clinical setting. Our solution is based on the integration of high-performance computing (HPC) and a data-centric architecture. We exploit the power of Common Workflow Language (CWL) to develop omics pipelines that can be containerised and deployed on large HPC systems.  Utilization of HPC resources, which provide cutting edge compute, storage and networking as well as highly optimised software stacks, brings huge benefits to the genomics pipelines in terms of performance and scalability. Virtualization technologies such as Docker containers allow HPC capabilities to be extended even further by facilitating usage of the hybrid cloud model.

Ask the experts

Carlos Costa

Carlos Costa
IBM Research, T.J. Watson

Uncertainty quantification as a service

Uncertainty quantification (UQ) is the fast-growing area of modern computational science. It can be defined as the end-to-end study of the reliability of scientific inferences. UQ methods are computationally intensive and require construction of complex work flows, which rely on a number of different software components often coming from different projects. There is a need for a portable and scalable UQ pipeline that will enable efficient stochastic modelling in various domains.

Our solution is called UQ as a Service (UQaaS), which is a portable and scalable framework deployable on a hybrid high-performance cloud infrastructure.

Reference
Uncertainty Quantification as a Service, CASCON ’18 Proc. 28th Int’l Conf. on Computer Science and Software Engineering, 2018.

Democratizing high‑performance computing

As high-performance computing is becoming more prevalent in scientific research, the complexity of describing the scientific problem as a set of defined computational tasks is increasing. These tasks become more numerous with particular data dependencies and with additional steps that rely on analysed results of earlier tasks. Workflow managers help ease the burden on researchers by orchestrating these computational “experiments”. They automate how the different stages of a workflow interact: controlling the flow of data between stages, managing their dependencies and organising which jobs are launched and in what order.

We are interested in removing the technical barriers to high-performance computing via the “as-a-service” paradigm, something that inherently relies on effective workflow management. Our focus is twofold:

  • Bringing smart workflow management to cloud-based high-performance computing,
  • Improving the usability of workflow managers for the end-user by making their use more intuitive and without the need for learning a whole new language for their construction.

GPU acceleration and application self-adaptation

Top-tier supercomputers around the world are stretching the borders of scientific discovery and modelling in many domains. This is only possible due to unprecedented compute capabilities and tight integration of accelerators such as GPUs. Whereas accelerators provide the compute scaling that Moore’s law can no longer provide, they pose a significant challenge to programmers who endeavor to master the complexities of a heterogeneous system.

Our research aims at lifting this burden from programmers and scientists who need to interact with these advanced systems. We do so by

  • Designing novel programming models to improve a programmer’s productivity,
  • Developing tools that can automatically infer best code-generation techniques and runtime settings for a given problem and target,
  • Creating new environments that will enable application to mutate themselves and become more efficient while they are being executed.

Ask the experts

Michael Johnston

Michael Johnston

Intelligent proactive monitoring in HPC systems

Upcoming high-performance computing (HPC) systems are on the critical path towards delivering the highest level of performance for large-scale applications. As supercomputers become larger in the drive to achieve the next levels of performance, energy efficiency has emerged as one of the foremost design goals. Relying upon contemporary technologies is simply not enough because power demand for exascale-class systems would require hundreds of megawatts of power.

New approaches to energy optimisation are being explored that optimize the entire HPC stack — from firmware and hardware through to the OS, application runtimes and workload managers. The challenge of optimising for energy efficiency requires an orchestrated approach across different components of the infrastructure.

We are developing an intelligent proactive monitoring framework for HPC systems, which will allow us to

  • Collect a broad range of energy, power and performance metrics from different parts of a data centre infrastructure using IoT devices as well as server/switch sensors and performance counters
  • Gather information about workloads from resource managers, schedulers, profilers and operating system on a job/task level and cluster/cloud level
  • Use AI to analyse the data and build models to optimise the energy efficiency of HPC systems.

References

[1] Energy Matters: Evolving Holistic Approaches to Energy and Power Management in HPC.
[2] The HPC PowerStack.

Ask the experts

Energy-aware scheduling

Optimizing performance under power constraints

Upcoming high-performance computing (HPC) systems are on the critical path towards delivering the highest level of performance for large-scale applications. If contemporary technology were used to build ever more powerful HPC systems, the power demand required by those systems would be unsustainable as it would require hundreds of megawatts of power. Thus, current HPC systems must be built considering energy efficiency as the first and foremost design goal. To achieve a sustainable power draw, future HPC systems will have to feature a power efficiency of around 50 GFlops/Watt. Such power efficiency levels require novel software/­hardware co-design, with software guiding static and dynamic power management.

Our goals are to develop methods for controlling and reducing power consumption, managing energy budgets and energy costs whilst maintaining high performance of applications and utilization levels of data center resources by using energy-aware scheduling (EAS) techniques.

Our approach is based on creating models for power consumption and performance prediction and then using these models to implement EAS policies in schedulers like IBM Spectrum LSF and Kubernetes and in frameworks such as Global Extensible Open Power Manager (GEOPM).

Our focus is on power and performance of complex workflows from various domains of computationl science developed at the Hartree Centre.

Ask the experts

Publications

[1] J. Eastep, et al.,
“Global Extensible Open Power Manager: A Vehicle for HPC Community Collaboration on Co-Designed Energy Management Solutions”
in Proc. International Supercomputing Conference “ISC 2017,” J. Kunkel, R. Yokota, P. Balaji, D. Keyes (Eds) High Performance Computing, LNCS 10266, Springer, pp. 394-412, 2017.

[2] V. Elisseev, et al.,
Energy Aware Scheduling Study on BlueWonder,”
in Proc. IEEE 4th International Workshop on Energy Efficient Supercomputing “E2SC@SC 2016,” pp.61-68, 2016.

[3] A. Auweter, et al.,
A Case Study of Energy Aware Scheduling on SuperMUC,”
in Proc. 29th International Supercomputing Conference “ISC 2014,” J.M. Kunkel, T. Ludwig, H.W. Meuer (Eds) Supercomputing, LNCS 8488, Springer, pp. 394-409, 2014.