Title :
Ordered Minimal Perfect Hash of the Human Genome and Implications for Duplicate Finding
Author :
Zobrist, Albert Lindsey
fDate :
March 31 2009-April 2 2009
Abstract :
Hashing long strings is difficult, especially when the alphabet is small. Chess and GO game board hashing has almost always been accomplished by using (letter position) pairs to index into a table of random numbers which are exclusive-orpsilad to create the hash value. The table of random numbers can be a huge source of different hash functions by varying any bit of any random number. Algorithms are developed here that can find hashes that are perfect, minimal, and even ordered for very large cases. The human genome is a great source of small alphabet strings that are long, so it is used as a test case here. An algorithm is presented that can solve for an ordered minimal perfect hash for the genome. It can also solve for the lesser cases of minimal perfect and perfect hash at higher speed. A statistical criterion is derived for obtaining the ordered minimal perfect hash with high probability. The algorithm and the statistical criterion lead to a duplicate finding algorithm that might prove to be fastest for important cases.
Keywords :
file organisation; random number generation; statistical analysis; GO game board; chess game board; duplicate finding algorithm; hash functions; human genome; ordered minimal perfect hash; probability; random numbers; small alphabet strings; statistical criterion; Bioinformatics; Books; Computer science; Event detection; Genomics; Humans; Probability; Programming profession; Testing; Wikipedia;
Conference_Titel :
Computer Science and Information Engineering, 2009 WRI World Congress on
Conference_Location :
Los Angeles, CA
Print_ISBN :
978-0-7695-3507-4
DOI :
10.1109/CSIE.2009.1070