• DocumentCode
    3062070
  • Title

    A modified Burrows-Wheeler transformation for case-insensitive search with application to suffix array compression

  • Author

    Sadakane, Kunihiko

  • Author_Institution
    Dept. of Inf. Sci., Tokyo Univ., Japan
  • fYear
    1999
  • fDate
    29-31 Mar 1999
  • Firstpage
    548
  • Abstract
    Summary form only given. The suffix array is a memory-efficient data structure for searching any substring of a text. It is also used for defining the Burrows-Wheeler transformation (BWT), which is the core of block sorting. When a compressed text is decoded, the inverse of BWT, which is faster than forward transformation, is performed and in the process the suffix array of the text is also obtained. This means that we can compress and transfer a text and its suffix array by simply using block sorting. This fact can be used for creating large full-text databases. We propose a modified Burrows-Wheeler transformation. By using our transformation, we obtain a suffix array from a compressed text which can be used for case-insensitive searches. An exact query can be done from the result of a case-insensitive search because we can decode the original text from the compressed text. It is available for case-insensitive and more general character conversions. We call the conversion unification and the text after conversion unified text. The proposed transformation is defined by the suffix array of the unified text. Our transformation is not a permutation of an alphabet followed by the original transformation but a combination of unification and the original transformation. From a compressed text using our transformation we can obtain the original text and the suffix array of the unified text. After decoding we can perform ambiguous searches like case-insensitive search by using the suffix array. Experimental results show that our transformation decreases the compression ratio very little. Though decompression and search takes more time than decoding of the original block sorting plus grep command, finding positions of keywords is quite fast which is available for advanced searches
  • Keywords
    data compression; data structures; decoding; query formulation; sorting; string matching; transforms; Burrows-Wheeler transformation; block sorting; case-insensitive search; character conversions; compressed text; decoding; exact query; full-text databases; inverse BWT; keywords; memory-efficient data structure; suffix array compression; text substring; unified text; Data compression; Data structures; Databases; Decoding; Encoding; Information science; Sorting; Writing;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Data Compression Conference, 1999. Proceedings. DCC '99
  • Conference_Location
    Snowbird, UT
  • ISSN
    1068-0314
  • Print_ISBN
    0-7695-0096-X
  • Type

    conf

  • DOI
    10.1109/DCC.1999.785705
  • Filename
    785705