Reducing resource waste in HPC through co-allocation, custom checkpoints, and lower false failure prediction rates

dc.contributor.authorFrank, Alvaro
dc.date.accessioned2022-09-22T06:15:52Z
dc.date.available2022-09-22T06:15:52Z
dc.date.issued2022
dc.description.abstractBigger systems are being deployed by High Performance Computing centers in order to fulfill the needs of modern scientific and big data applications as well as to match the increased amount of users in said systems. This thesis explores three methods to reduce wasted computational resources on modern HPC systems with thousands of components. The approaches explored here increase job throughput in HPC systems using co-allocation, reduce unnecessary checkpoints that are triggered after failure predictions and improve checkpoint intervals for common jobs with medium probability of failure. To accomplish these goals, the work first presents a new node sharing strategy for batch systems and shows how it can increase scheduling throughput when compared to standard node allocation methods. Secondly the thesis proposes a new optimal checkpoint interval for jobs with short to medium runtimes that can reduce the expected overhead from checkpointing. Finally it introduces a node failure prediction method tailored to big HPC systems that reduces false positive rates. This thesis offers therefore new insights into the efficiency deficiencies that follow from job failures and resource under-utilization as HPC systems grow in size, while also proposing three techniques that help alleviate said deficiencies.en_GB
dc.identifier.doihttp://doi.org/10.25358/openscience-7667
dc.identifier.urihttps://openscience.ub.uni-mainz.de/handle/20.500.12030/7682
dc.identifier.urnurn:nbn:de:hebis:77-openscience-5c313850-3828-420a-9f38-f3a45867e2b17
dc.language.isoengde
dc.rightsCC-BY-SA-4.0*
dc.rights.urihttps://creativecommons.org/licenses/by-sa/4.0/*
dc.subject.ddc004 Informatikde_DE
dc.subject.ddc004 Data processingen_GB
dc.titleReducing resource waste in HPC through co-allocation, custom checkpoints, and lower false failure prediction ratesen_GB
dc.typeDissertationde
jgu.date.accepted2022-09-12
jgu.description.extentX, 136 Seiten (Illustrationen, Diagramme)de
jgu.organisation.departmentFB 08 Physik, Mathematik u. Informatikde
jgu.organisation.nameJohannes Gutenberg-Universität Mainz
jgu.organisation.number7940
jgu.organisation.placeMainz
jgu.organisation.rorhttps://ror.org/023b0x485
jgu.rights.accessrightsopenAccess
jgu.subject.ddccode004de
jgu.type.dinitypePhDThesisen_GB
jgu.type.resourceTextde
jgu.type.versionOriginal workde

Files

Original bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
reducing_resource_waste_in_hp-20220913134738605.pdf
Size:
5.75 MB
Format:
Adobe Portable Document Format
Description:

License bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
license.txt
Size:
3.57 KB
Format:
Item-specific license agreed upon to submission
Description: