Title : 
Application-bypass reduction for large-scale clusters
         
        
            Author : 
Wagner, Adam ; Buntinas, Darius ; Panda, Dhabaleswar K. ; Brightwell, Ron
         
        
            Author_Institution : 
Dept. of Comput. & Inf. Sci., The Ohio State Univ., Columbus, OH, USA
         
        
        
        
        
        
            Abstract : 
Process skew is an important factor in the performance of parallel applications, especially in large-scale clusters. Reduction is a common collective operation which, by its nature, introduces implicit synchronization between the processes involved in the communication and is therefore highly susceptible to performance degradation due to process skew. A collective operation with application-bypass does not require the application to block in order for the operation to make progress. Application-bypass collective operations are therefore highly tolerant of skew. In this paper we describe the design and implementation of an application-bypass version of the reduction operation in MPICH over GM. We evaluate our implementation on a 16-node cluster. Under conditions of process skew we find a factor of improvement of up to 3.3 for our application-bypass reduction versus the default MPICH implementation. In addition, we see that this factor of improvement increases with system size, indicating that the application-bypass implementation is more scalable and skew-tolerant than the default non-application-bypass version. This framework promises design and development of high-performance and scalable collective communication libraries for next-generation large-scale clusters.
         
        
            Keywords : 
computer network management; message passing; parallel processing; performance evaluation; workstation clusters; GM; MPICH; application-bypass reduction; collective operation; large-scale clusters; nonapplication-bypass version; parallel applications; performance degradation; process skew; reduction operation; scalable collective communication libraries; skew-tolerant; synchronization; Application software; Computer network management; Computer networks; Concurrent computing; Degradation; Delay; Information science; Laboratories; Large-scale systems; Libraries; Message passing; Parallel processing; Visualization;
         
        
        
        
            Conference_Titel : 
Cluster Computing, 2003. Proceedings. 2003 IEEE International Conference on
         
        
            Print_ISBN : 
0-7695-2066-9
         
        
        
            DOI : 
10.1109/CLUSTR.2003.1253340