• DocumentCode
    1379313
  • Title

    Chip Self-Organization and Fault Tolerance in Massively Defective Multicore Arrays

  • Author

    Collet, Jacques Henri ; Zajac, Piotr ; Psarakis, Mihalis ; Gizopoulos, Dimitris

  • Author_Institution
    Centre ´´Nat. de la Rech. Sci., LAAS CNRS, Toulouse, France
  • Volume
    8
  • Issue
    2
  • fYear
    2011
  • Firstpage
    207
  • Lastpage
    217
  • Abstract
    We study chip self-organization and fault tolerance at the architectural level to improve dependable continuous operation of multicore arrays in massively defective nanotechnologies. Architectural self-organization results from the conjunction of self-diagnosis and self-disconnection mechanisms (to identify and isolate most permanently faulty or inaccessible cores and routers), plus self-discovery of routes to maintain the communication in the array. In the methodology presented in this work, chip self-diagnosis is performed in three steps, following an ascending order of complexity: interconnects are tested first, then routers through mutual test, and cores in the last step. The mutual testing of routers is especially important as faulty routers are disconnected by good ones with no assumption on the behavior of defective elements. Moreover, the disconnection of faulty routers is not physical (“hard”) but logical (“soft”) in that a good router simply stops communicating with any adjacent router diagnosed as defective. There is no physical reconfiguration in the chip and no need for spare elements. Ultimately, the multicore array may be viewed as a black box, which incorporates protection mechanisms and self-organizes, while the external control reduces to a simple chip validation test which, in the simplest cases, reduces to counting the number of valid and accessible cores.
  • Keywords
    fault tolerant computing; multiprocessing systems; multiprocessor interconnection networks; nanotechnology; network routing; parallel architectures; architectural level; chip self-diagnosis; chip self-organization; fault tolerance; faulty routers; massively defective nanotechnologies; multicore array; mutual testing; self-disconnection mechanisms; Multicore architectures; fault diagnosis; fault tolerance; massively defective nanotechnologies.; multiprocessors;
  • fLanguage
    English
  • Journal_Title
    Dependable and Secure Computing, IEEE Transactions on
  • Publisher
    ieee
  • ISSN
    1545-5971
  • Type

    jour

  • DOI
    10.1109/TDSC.2009.53
  • Filename
    5374421