Subir material

Suba sus trabajos a SEDICI, para mejorar notoriamente su visibilidad e impacto

 

Mostrar el registro sencillo del ítem

dc.date.accessioned 2021-09-08T18:40:10Z
dc.date.available 2021-09-08T18:40:10Z
dc.date.issued 2020
dc.identifier.uri http://sedici.unlp.edu.ar/handle/10915/124463
dc.description.abstract Handling faults is a growing concern in HPC. In future exascale systems, it is projected that silent undetected errors will occur several times a day, increasing the occurrence of corrupted results. In this article, we propose SEDAR, which is a methodology that improves system reliability against transient faults when running parallel message-passing applications. Our approach, based on process replication for detection, combined with different levels of checkpointing for automatic recovery, has the goal of helping users of scientific applications to obtain executions with correct results. SEDAR is structured in three levels: (1) only detection and safe-stop with notification; (2) recovery based on multiple system-level checkpoints; and (3) recovery based on a single valid user-level checkpoint. As each of these variants supplies a particular coverage but involves limitations and implementation costs, SEDAR can be adapted to the needs of the system. In this work, a description of the methodology is presented and the temporal behavior of employing each SEDAR strategy is mathematically described, both in the absence and presence of faults. A model that considers all the fault scenarios on a test application is introduced to show the validity of the detection and recovery mechanisms. An overhead evaluation of each variant is performed with applications involving different communication patterns; this is also used to extract guidelines about when it is beneficial to employ each SEDAR protection level. As a result, we show its efficacy and viability to tolerate transient faults in target HPC environments. en
dc.format.extent 240-254 es
dc.language en es
dc.subject Soft error detection es
dc.subject Automatic recovery es
dc.subject System-level checkpoint es
dc.subject User-level checkpoint es
dc.title Soft errors detection and automatic recovery based on replication combined with different levels of checkpointing en
dc.type Articulo es
sedici.identifier.other arXiv:2007.08552 es
sedici.identifier.other doi:10.1016/j.future.2020.07.003 es
sedici.identifier.issn 0167-739X es
sedici.creator.person Montezanti, Diego Miguel es
sedici.creator.person Rucci, Enzo es
sedici.creator.person De Giusti, Armando Eduardo es
sedici.creator.person Naiouf, Marcelo es
sedici.creator.person Rexachs del Rosario, Dolores es
sedici.creator.person Luque Fadón, Emilio es
sedici.subject.materias Ciencias Informáticas es
sedici.description.fulltext true es
mods.originInfo.place Instituto de Investigación en Informática es
sedici.subtype Preprint es
sedici.rights.license Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International (CC BY-NC-ND 4.0)
sedici.rights.uri http://creativecommons.org/licenses/by-nc-nd/4.0/
sedici.description.peerReview peer-review es
sedici.relation.journalTitle Future Generation Computer Systems es
sedici.relation.journalVolumeAndIssue vol. 113 es


Descargar archivos

Este ítem aparece en la(s) siguiente(s) colección(ones)

Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International (CC BY-NC-ND 4.0) Excepto donde se diga explícitamente, este item se publica bajo la siguiente licencia Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International (CC BY-NC-ND 4.0)