Soft errors detection and automatic recovery based on replication combined with different levels of checkpointing

Montezanti, Diego Miguel; Rucci, Enzo; De Giusti, Armando Eduardo; Naiouf, Marcelo; Rexachs del Rosario, Dolores; Luque Fadón, Emilio

Buscar material

Busque entre los 171322 recursos disponibles en el repositorio

Subir material

Suba sus trabajos a SEDICI, para mejorar notoriamente su visibilidad e impacto

Unidades académicas
→
Facultad de Informática
→
Instituto de Investigación en Informática (III-LIDI)
→
Publicaciones

Soft errors detection and automatic recovery based on replication combined with different levels of checkpointing

2020

Tipo de documento: Preprint

Resumen

Handling faults is a growing concern in HPC. In future exascale systems, it is projected that silent undetected errors will occur several times a day, increasing the occurrence of corrupted results. In this article, we propose SEDAR, which is a methodology that improves system reliability against transient faults when running parallel message-passing applications. Our approach, based on process replication for detection, combined with different levels of checkpointing for automatic recovery, has the goal of helping users of scientific applications to obtain executions with correct results. SEDAR is structured in three levels: (1) only detection and safe-stop with notification; (2) recovery based on multiple system-level checkpoints; and (3) recovery based on a single valid user-level checkpoint. As each of these variants supplies a particular coverage but involves limitations and implementation costs, SEDAR can be adapted to the needs of the system. In this work, a description of the methodology is presented and the temporal behavior of employing each SEDAR strategy is mathematically described, both in the absence and presence of faults. A model that considers all the fault scenarios on a test application is introduced to show the validity of the detection and recovery mechanisms. An overhead evaluation of each variant is performed with applications involving different communication patterns; this is also used to extract guidelines about when it is beneficial to employ each SEDAR protection level. As a result, we show its efficacy and viability to tolerate transient faults in target HPC environments.

Información general

Fecha de publicación: 2020

Idioma del documento: Inglés

Revista: Future Generation Computer Systems; vol. 113

Institución de origen: Instituto de Investigación en Informática

Otros Identificadores: arXiv:2007.08552

ISSN: 0167-739X

Páginas: 240-254

Palabras claves: Soft error detection ; Automatic recovery ; System-level checkpoint ; User-level checkpoint

Materias: Ciencias Informáticas

Descargar archivos

Documento completo
Descargar archivo (1.051Mb) - PDF

BASE

GoogleScholar

Creado el: 8 de septiembre de 2021

Disponible en SEDICI desde: 8 de septiembre de 2021

Por favor, utilice uno de estos identificadores(URI) para citar o enlazar este ítem:

http://sedici.unlp.edu.ar/handle/10915/124463

https://doi.org/10.1016/j.future.2020.07.003

Mostrar el registro completo del ítem

Este ítem aparece en la(s) siguiente(s) colección(ones)

Instituto de Investigación en Informática (III-LIDI) → Publicaciones

Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International (CC BY-NC-ND 4.0)

Excepto donde se diga explícitamente, este item se publica bajo la siguiente licencia Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International (CC BY-NC-ND 4.0)

Iniciar sesión