Fault-Tolerant MPI in High Performance Computing: Semantics and Application Scenarios

Dr. Graham E. Fagg
Dr. Edgar Gabriel
Dr. Jack Dongarra
University of Tennessee, Knoxville, TN, USA.

4:30PM-5:30PM New York Time
Tuesday, May 4, 2004.

With increasing numbers of processors on today's machines, the probability for node or link failures is also increasing. Therefore, application level fault-tolerance is becoming more of an important issue for both end-users and the institutions running the machines. This talk presents the semantics of a fault tolerant version of the Message Passing Interface, the de-facto standard for communication in scientific applications, which gives applications the possibility to recover from a node or link error and continue execution in a well defined way. The architecture of FT-MPI, an implementation of MPI using the semantics presented above as well as benchmark results with various applications are presented. An example of a fault-tolerant parallel equation solver, performance results as well as the time for recovering from a process failure are furthermore detailed.

Project homepage: http://icl.cs.utk.edu/ftmpi/

Slides