DocumentCode :
1897597
Title :
Token-based dictionary pattern matching for text analytics
Author :
Polig, Raphael ; Atasu, Kubilay ; Hagleitner, Christoph
Author_Institution :
IBM Res. - Zurich, Rueschlikon, Switzerland
fYear :
2013
fDate :
2-4 Sept. 2013
Firstpage :
1
Lastpage :
6
Abstract :
When performing queries for text analytics on unstructured text data, a large amount of the processing time is spent on regular expressions and dictionary matching. In this paper we present a compilable architecture for token-bound pattern matching with support for token pattern sequence detection. The architecture presented is capable of detecting several hundreds of dictionaries, each containing thousands of elements at high throughput. A programmable state machine is used as pattern detection engine to achieve deterministic performance while maintaining low storage requirements. For the detection of token sequences, a dedicated circuitry is compiled based on a non-deterministic automaton. A cascaded result lookup ensures efficient storage while allowing multi-token elements to be detected and multiple dictionary hits to be reported. We implemented on an Altera Stratix IV GX530, and were able to process up to 16 documents in parallel at a peak throughput rate of 9.7 Gb/s.
Keywords :
dictionaries; finite state machines; pattern matching; query processing; text analysis; Altera Stratix IV GX530; cascaded result lookup; compilable architecture; dedicated circuitry; deterministic performance; dictionary detection; dictionary matching; multitoken elements; nondeterministic automaton; pattern detection engine; programmable state machine; text analytics querying; token pattern sequence detection; token sequence detection; token-based dictionary pattern matching; unstructured text data; Automata; Computer architecture; Dictionaries; Doped fiber amplifiers; Engines; Pattern matching; Throughput;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Field Programmable Logic and Applications (FPL), 2013 23rd International Conference on
Conference_Location :
Porto
Type :
conf
DOI :
10.1109/FPL.2013.6645535
Filename :
6645535
Link To Document :
بازگشت