Systems built from commodity processors dominate high-performance computing today, with systems containing thousands of processors now being deployed. Similarly, large-scale Grids containing hundreds of thousands of sites are being contemplated, developed and deployed. As node counts for multi-teraflop systems grow to tens of thousands, with proposed petaflop system likely to contain hundreds of thousands of nodes, we must rethink traditional assumptions about software scaling and manageability and hardware reliability.
Although the mean time before failure (MTBF) for the individual components (i.e., processors, disks, memories, power supplies, fans and networks) is high, the large overall component count means the system itself can still fail more frequently. For example, a system containing 10,000 nodes, each with a mean time to failure of 10**6 hours, would system mean time to failure of only 100 hours, under the generous assumption of failure independence. Historically, large-scale scientific applications have used application-mediated checkpoint and restart techniques to deal with failures. However, these schemes can be problematic in an environment where the interval between checkpoints is comparable to the MTBF.
In contrast to parallel systems, distributed software for networks, whether transport protocols or web/Grid services, are designed to be resilient to component failures. Our thesis is that these “two worlds” of software – distributed systems and parallel systems – must meet, embodying ideas from each, if we are to build resilient systems. In this talk, after presenting examples that quantify these problems above, we describe possible approaches for the design and effective use of large-scale systems. The approaches range from intelligent hardware monitoring and adaptation, through low-overhead recovery schemes, to alternative models of system software, including evolutionary adaptation.
Dan Reed is the Chancellor’s Eminent Professor at the University of North Carolina at Chapel Hill, as well as the Director of the Renaissance Computing Institute (RENCI), a venture supported by the three universities – the University of North Carolina at Chapel Hill, Duke University and North Carolina State University – that is exploring the interactions of computing technology with the sciences, arts and humanities. Reed also serves as Vice-Chancellor for Information Technology and Chief Information Officer for the University of North Carolina at Chapel Hill.
Dr. Reed is a member of President George W. Bush’s Information Technology Advisory Committee, charged with providing advice on information technology issues and challenges to the President, and he chairs the subcommittee on computational science. He is a board member for the Computing Research Association, which represents the interests of the major academic departments and industrial research laboratories. He was previously Director of the National Center for Supercomputing Applications (NCSA) at the University of Illinois at Urbana-Champaign, where he also led National Computational Science Alliance, a consortium of roughly fifty academic institutions and national laboratories that is developing next-generation software infrastructure of scientific computing. He was also one of the principal investigators and chief architect for the NSF TeraGrid. He received his PhD in computer science in 1983 from Purdue University.