Title : 
LO-FA-MO: Fault Detection and Systemic Awareness for the QUonG Computing System
         
        
            Author : 
Ammendola, Roberto ; Biagioni, Andrea ; Frezza, Ottorino ; Cicero, Francesca Lo ; Lonardo, Alessandro ; Paolucci, Pier Stanislao ; Rossetti, Davide ; Simula, Francesco ; Tosoratto, Laura ; Vicini, Piero
         
        
            Author_Institution : 
Roma “Tor Vergata”, INFN, Rome, Italy
         
        
        
        
        
        
            Abstract : 
QUonG is a parallel computing platform developed at INFN and equipped with commodity multi-core CPUs coupled with last generation NVIDIA GPUs. Computing nodes communicate through a point-to-point, high performance, low latency 3D torus network implemented by the APEnet+ FPGA-based interconnect. Scaling of this cluster towards peta-and possibly exascale is a prominent investigation point and in this context fault tolerance issues are structural. Typical fault tolerance solutions for HPC systems (e.g. checkpoint/restart) need to be triggered to be applied in an automated and transparent way, or at least knowledge about occurring faults needs propagating in order to prompt a readjustment: an effective tool to detect faults and make the system aware of them is required. Thus, as a first step towards a fault tolerant QUonG we designed the Local Fault Monitor (LO|FA|MO), an HW/SW solution aimed at providing systemic fault awareness. LO|FA|MO allows the detection of node faults thanks to a mutual watchdog mechanism between the host and the APEnet+ NIC, moreover, diagnostic messages can be delivered to neighbour nodes through both the 3D network and a secondary connection for service communication. The double path ensures that no fault remains unknown at the global level, guaranteeing systemic fault awareness with no single point of failure. In this paper we describe our LO|FA|MO implementation, reporting preliminary measures that show scalability and its next to nil impact on system performance.
         
        
            Keywords : 
fault tolerant computing; multiprocessing systems; parallel processing; system monitoring; APEnet+ FPGA-based interconnect; APEnet+ NIC; HPC systems; HW/SW solution; INFN; LO|FA|MO; NVIDIA GPU; QUonG computing system; commodity multicore CPU; computing nodes; diagnostic messages; fault tolerance; fault tolerant QUonG; high performance 3D torus network; local fault monitor; low latency 3D torus network; mutual watchdog mechanism; node fault detection; parallel computing platform; point-to-point 3D torus network; service communication; system performance; systemic awareness; systemic fault awareness; Monitoring; Peer-to-peer computing; Registers; Temperature measurement; Temperature sensors; Three-dimensional displays; Fault detection; fault tolerant systems; high performance computing; networks;
         
        
        
        
            Conference_Titel : 
Reliable Distributed Systems (SRDS), 2014 IEEE 33rd International Symposium on
         
        
            Conference_Location : 
Nara
         
        
        
            DOI : 
10.1109/SRDS.2014.33