Enhancing application checkpointing and migration in HPC

Date issued

Editors

Journal Title

Journal ISSN

Volume Title

Publisher

ItemDissertationOpen Access

Abstract

It is predicted that the number of nodes and cores per node will rapidly increase with the upcoming era of exascale supercomputers. Such growing number of hardware components causes a decrease in the Mean Time Between Failures(MTBF). Furthermore, in order to efficiently exploit such exascale systems and fully utilize all available resources of a node, multiple applications could share execution on one node and compete for the resources available on this node, for instance, computing cores and accelerators. However, applications competing for the same resources, result in resource overload. Application checkpointing and migration are promising solutions to improve fault tolerance and to balance workloads between computing nodes, while avoiding resources overload. In this thesis, we address part of the challenges related to performing application checkpointing and migration in HPC. We consider the problem of checkpointing for load balancing between different resources on a heterogeneous node. This problem is affiliated with context switching between the host and the accelerator memory spaces. We present a tool collection (ConSerner) (Context Serializer) that automatically identifies, gathers, and serializes the context of a kernel and migrates it to an accelerator's memory, where an accelerator kernel is executed with this data. We consider the problem of reducing checkpoint size and migration time in a virtualized HPC environment. We notice that not all data objects within a virtual machine (VM) image are required for a successful checkpoint or a migration from a source to a destination node. Therefore, discarding these data objects that are not required from the virtual machine image before migration/checkpointing significantly decreases the migration time and the checkpoint storage size. In this thesis, we propose a novel approach for the acceleration of VM migration and the reduction of VM checkpoint storage size. We take advantage of the fact that freed memory regions within the guest system are not recognised by the hypervisor. Therefore, we fill them with zeros, so that zero-page detection and compression can work more efficiently. We demonstrate that our approach can boost the migration time by up to 10%, when it is applied alone, and by up to 60%, when it is combined with compression. We also show that our approach reduces the checkpoint size of our tested applications by up to 9%, without compression, and by up to 94% with compression. Furthermore, for checkpointing, we consider the problem of scalability and the problem of checkpoint size reduction. A study on a wide range of HPC applications is performed. For each application, we show the deduplication potential of its checkpoints for different deduplication configurations. Since not all applications provide built-in checkpointing, we use DMTCP for system Level checkpointing. Using this type of checkpointing, we show that there is a high potential for saving data sent to disk and for increasing checkpointing performance and scalability.

Description

Keywords

Citation

Relationships