Title :
Character gazetteer for Named Entity Recognition with linear matching complexity
Author :
Dlugolinsky, Stefan ; Giang Nguyen ; Laclavik, Michal ; Seleng, Martin
Author_Institution :
Inst. of Inf., Bratislava, Slovakia
Abstract :
A large amount of unstructured data is produced daily through numerous media around us. Despite that computer systems are becoming more powerful, even the commodity hardware, processing of such data and gaining useful information in time efficient manner remains a problem. One of the domains in unstructured data processing is Natural Language Processing (NLP). NLP covers areas like information extraction, machine translation, word sense disambiguation, automated question answering, etc. All of these areas require fast and precise Named Entity Recognition (NER), which is not a trivial task because of the processed data size and heterogeneity. Our effort in this research area is to provide fast tokenization and precise NER with linear complexity. In this paper, we present a character gazetteer with linear tokenization as well as NER and compare its two tree data structure representations; i.e. multiway tree implemented by hash maps and first child-next sibling binary tree. Our measurements shows that one outperforms the other in processing time, while the other outperforms it in memory consumption efficiency.
Keywords :
computational complexity; natural language processing; pattern matching; tree data structures; NER; NLP; automated question answering; character gazetteer; computer systems; first child-next sibling binary tree; hash maps; information extraction; linear matching complexity; linear tokenization; machine translation; memory consumption efficiency; muItiway tree; named entity recognition; natural language processing; tree data structure representations;; unstructured data; unstructured data processing; word sense disambiguation; Complexity theory; Data mining; Electronic publishing; Encyclopedias; Internet; Logic gates; gazetteer; named entity recognition; natural language processing; text processing; tokenization;
Conference_Titel :
Information and Communication Technologies (WICT), 2013 Third World Congress on
Conference_Location :
Hanoi
DOI :
10.1109/WICT.2013.7113096