• DocumentCode
    3091950
  • Title

    HR-NET: A Highly Reliable Message-Passing Mechanism for Cluster File System

  • Author

    Zhou, Jiang ; Ma, Can ; Xiong, Jin ; Meng, Dan

  • Author_Institution
    Nat. Res. Center for Intell. Comput. Syst., Grad. Univ. of Chinese Acad. of Sci., Beijing, China
  • fYear
    2011
  • fDate
    28-30 July 2011
  • Firstpage
    364
  • Lastpage
    371
  • Abstract
    As PC clusters increase in popularity and quantity, message-passing between nodes has been an important issue for high failure rate in the network. File access in a cluster file system often contains several sub-operations, each includes one or more network transmissions. Any network failures will cause the file system service unavailable. In this paper, we describe a highly reliable message-passing mechanism (HRNET), which tolerates both software and hardware network failures. HR-NET provides fine-grained, connection-level fail over across communication path redundancy. With it the file system can keep passing messages until it either recovers from network failures or it is failed over to a backup. Load balance for messages is also achieved to relieve network traffic. For transmission timeout, HR-NET proposes the message priority scheduling which dynamically manages messages in an appropriate order to tolerate request-response failures between clients and servers. As HR-NET is completely independent, there are neither any changes to standard protocol stacks nor modifications at upper file system. Performance results show that HR-NET takes full advantage of network bandwidth with average 6.17% throughput loss and provides a fast recovery. Experiments with cluster file system dispose that the overall performance degradation is below 8% due to failover of HR-NET while the reliability is highly enhanced.
  • Keywords
    message passing; protocols; resource allocation; scheduling; HR-NET; PC clusters; cluster file system; load balance; message passing mechanism; message priority scheduling; network failures; protocol stacks; transmission timeout; Fault tolerance; Fault tolerant systems; File systems; Hardware; Protocols; Servers; cluster file system; fault tolerance; high reliability; message passing mechanism;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Networking, Architecture and Storage (NAS), 2011 6th IEEE International Conference on
  • Conference_Location
    Dalian, Liaoning
  • Print_ISBN
    978-1-4577-1172-5
  • Electronic_ISBN
    978-0-7695-4509-7
  • Type

    conf

  • DOI
    10.1109/NAS.2011.21
  • Filename
    6005481