Title :
Finding patterns in strings using suffixarrays
Author :
Stehouwer, Herman ; van Zaanen, M.
Author_Institution :
Tilburg Centre for Cognition & Commun., Tilburg Univ., Tilburg, Netherlands
Abstract :
Finding regularities in large data sets requires implementations of systems that are efficient in both time and space requirements. Here, we describe a newly developed system that exploits the internal structure of the enhanced suffixarray to find significant patterns in a large collection of sequences. The system searches exhaustively for all significantly compressing patterns where patterns may consist of symbols and skips or wildcards. We demonstrate a possible application of the system by detecting interesting patterns in a Dutch and an English corpus.
Keywords :
data compression; natural language processing; string matching; Dutch corpus; English corpus; compressing patterns; interesting patterns; large data sets; strings; suffixarrays; Arrays; Buildings; Cognition; Natural languages; Software; Sorting;
Conference_Titel :
Computer Science and Information Technology (IMCSIT), Proceedings of the 2010 International Multiconference on
Conference_Location :
Wisla
Print_ISBN :
978-1-4244-6432-6
DOI :
10.1109/IMCSIT.2010.5679928