Title :
HR-NET: A Highly Reliable Message-Passing Mechanism for Cluster File System
Author :
Zhou, Jiang ; Ma, Can ; Xiong, Jin ; Meng, Dan
Author_Institution :
Nat. Res. Center for Intell. Comput. Syst., Grad. Univ. of Chinese Acad. of Sci., Beijing, China
Abstract :
As PC clusters increase in popularity and quantity, message-passing between nodes has been an important issue for high failure rate in the network. File access in a cluster file system often contains several sub-operations, each includes one or more network transmissions. Any network failures will cause the file system service unavailable. In this paper, we describe a highly reliable message-passing mechanism (HRNET), which tolerates both software and hardware network failures. HR-NET provides fine-grained, connection-level fail over across communication path redundancy. With it the file system can keep passing messages until it either recovers from network failures or it is failed over to a backup. Load balance for messages is also achieved to relieve network traffic. For transmission timeout, HR-NET proposes the message priority scheduling which dynamically manages messages in an appropriate order to tolerate request-response failures between clients and servers. As HR-NET is completely independent, there are neither any changes to standard protocol stacks nor modifications at upper file system. Performance results show that HR-NET takes full advantage of network bandwidth with average 6.17% throughput loss and provides a fast recovery. Experiments with cluster file system dispose that the overall performance degradation is below 8% due to failover of HR-NET while the reliability is highly enhanced.
Keywords :
message passing; protocols; resource allocation; scheduling; HR-NET; PC clusters; cluster file system; load balance; message passing mechanism; message priority scheduling; network failures; protocol stacks; transmission timeout; Fault tolerance; Fault tolerant systems; File systems; Hardware; Protocols; Servers; cluster file system; fault tolerance; high reliability; message passing mechanism;
Conference_Titel :
Networking, Architecture and Storage (NAS), 2011 6th IEEE International Conference on
Conference_Location :
Dalian, Liaoning
Print_ISBN :
978-1-4577-1172-5
Electronic_ISBN :
978-0-7695-4509-7
DOI :
10.1109/NAS.2011.21