• DocumentCode
    495484
  • Title

    Ordered Minimal Perfect Hash of the Human Genome and Implications for Duplicate Finding

  • Author

    Zobrist, Albert Lindsey

  • Volume
    4
  • fYear
    2009
  • fDate
    March 31 2009-April 2 2009
  • Firstpage
    106
  • Lastpage
    111
  • Abstract
    Hashing long strings is difficult, especially when the alphabet is small. Chess and GO game board hashing has almost always been accomplished by using (letter position) pairs to index into a table of random numbers which are exclusive-orpsilad to create the hash value. The table of random numbers can be a huge source of different hash functions by varying any bit of any random number. Algorithms are developed here that can find hashes that are perfect, minimal, and even ordered for very large cases. The human genome is a great source of small alphabet strings that are long, so it is used as a test case here. An algorithm is presented that can solve for an ordered minimal perfect hash for the genome. It can also solve for the lesser cases of minimal perfect and perfect hash at higher speed. A statistical criterion is derived for obtaining the ordered minimal perfect hash with high probability. The algorithm and the statistical criterion lead to a duplicate finding algorithm that might prove to be fastest for important cases.
  • Keywords
    file organisation; random number generation; statistical analysis; GO game board; chess game board; duplicate finding algorithm; hash functions; human genome; ordered minimal perfect hash; probability; random numbers; small alphabet strings; statistical criterion; Bioinformatics; Books; Computer science; Event detection; Genomics; Humans; Probability; Programming profession; Testing; Wikipedia;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Computer Science and Information Engineering, 2009 WRI World Congress on
  • Conference_Location
    Los Angeles, CA
  • Print_ISBN
    978-0-7695-3507-4
  • Type

    conf

  • DOI
    10.1109/CSIE.2009.1070
  • Filename
    5170970