Paul Graham, Los Alamos National Laboratory, Los Alamos, NM
Maya Gokhale, Los Alamos National Laboratory, Los Alamos, NM
Reconfigurable supercomputers are a possible next generation architecture for traditional cluster-based supercomputers. In this cluster-based architecture, coprocessor boards will be added to each node for application-specific acceleration. There are many interesting coprocessor boards with Field-Programmable Gate Arrays (FPGAs) that are well matched with supercomputing codes. Already large-scale supercomputing clusters have been plagued by memory upsets due to neutron-based terrestrial radiation, which is a situation bound to worsen with the addition of co-processor boards. These memory upsets can cause silent data corruption and unreproducible system crashes. FPGAbased reconfigurable supercomputers will also be susceptible to circuit changes from memory upsets. Therefore, reliability analysis of these systems and their codes is a necessary step in designing and using these machines. In this abstract, we present an overview of a reliability analysis toolset, called the Scalable Tool for the Analysis of Reliable Systems (STAR Systems), with modules for determining the reliability of FPGA designs (STAR-Circuits) and reconfigurable supercomputers (STAR-Reconfigurable SuperComputers).
Citation:
Heather Quinn, Debayan Bhaduri, Christof Teuscher, Paul Graham, Maya Gokhale, "The STAR-C Truth: Analyzing Reconfigurable Supercomputing Reliability," fccm, pp.323-324, 14th Annual IEEE Symposium on Field-Programmable Custom Computing Machines (FCCM'06), 2006