A methodology for soft errors detection and automatic recovery

Montezanti, Diego Miguel; De Giusti, Armando Eduardo; Naiouf, Marcelo; Villamayor, Jorge; Rexachs del Rosario, Dolores; Luque Fadón, Emilio

Buscar material

Busque entre los 171448 recursos disponibles en el repositorio

Subir material

Suba sus trabajos a SEDICI, para mejorar notoriamente su visibilidad e impacto

Unidades académicas
→
Facultad de Informática
→
Instituto de Investigación en Informática (III-LIDI)
→
Publicaciones

A methodology for soft errors detection and automatic recovery

Pertenece al libro: 2017 International Conference on High Performance Computing & Simulation (HPCS)

2017

Tipo de documento: Objeto de conferencia

Resumen

Handling faults is a growing concern in HPC; higher error rates, larger detection intervals and silent faults are expected in the future. It is projected that, in exascale systems, errors will occur several times a day, and they will propagate to generate errors that will range from process crashes to corrupted results because of undetected errors. In this article, we propose a methodology that improves system reliability against transient faults, when running parallel message-passing applications. The proposed solution, based on process replication, has the goal of helping programmers and users of parallel scientific applications to achieve reliable executions with correct results. This work presents a characterization of the strategy, defining its behavior in the presence of faults and modeling the temporal costs of employing it. As a result, we show its efficacy and viability to tolerate transient faults in HPC systems.

Información general

Fecha de exposición: julio 2017

Fecha de publicación: 2017

Idioma del documento: Inglés

Evento: 2017 International Conference on High Performance Computing & Simulation (HPCS) (Italia, 17 al 21 de julio de 2017)

Institución de origen: Instituto de Investigación en Informática

ISBN: 978-1-5386-3250-5

Páginas: 434-441

Palabras claves: Soft error detection ; Automatic recovery ; Systemlevel checkpoint ; User-level checkpoint

Materias: Ciencias Informáticas

Descargar archivos

Documento completo
Descargar archivo (264.2Kb) - PDF

BASE

GoogleScholar

Creado el: 6 de diciembre de 2021

Disponible en SEDICI desde: 6 de diciembre de 2021

Por favor, utilice uno de estos identificadores(URI) para citar o enlazar este ítem:

http://sedici.unlp.edu.ar/handle/10915/129169

https://doi.org/10.1109/HPCS.2017.71

Mostrar el registro completo del ítem

Este ítem aparece en la(s) siguiente(s) colección(ones)

Instituto de Investigación en Informática (III-LIDI) → Publicaciones

Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)

Excepto donde se diga explícitamente, este item se publica bajo la siguiente licencia Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)

Iniciar sesión