Title :
An Algebraic Approach to Rule-Based Information Extraction
Author :
Reiss, Frederick ; Raghavan, Sriram ; Krishnamurthy, Rajasekar ; Zhu, Huaiyu ; Vaithyanathan, Shivakumar
Author_Institution :
Almaden Res. Center, IBM, San Jose, CA
Abstract :
Traditional approaches to rule-based information extraction (IE) have primarily been based on regular expression grammars. However, these grammar-based systems have difficulty scaling to large data sets and large numbers of rules. Inspired by traditional database research, we propose an algebraic approach to rule-based IE that addresses these scalability issues through query optimization. The operators of our algebra are motivated by our experience in building several rule-based extraction programs over diverse data sets. We present the operators of our algebra and propose several optimization strategies motivated by the text-specific characteristics of our operators. Finally we validate the potential benefits of our approach by extensive experiments over real-world blog data.
Keywords :
algebra; grammars; knowledge engineering; query processing; algebraic approach; grammar-based systems; large data sets; query optimization; real-world blog data; regular expression grammars; rule-based extraction programs; rule-based information extraction; text-specific characteristics; traditional database research; Algebra; Data mining; Databases; Information services; Instruments; Intelligent structures; Internet; Query processing; Scalability; Web sites;
Conference_Titel :
Data Engineering, 2008. ICDE 2008. IEEE 24th International Conference on
Conference_Location :
Cancun
Print_ISBN :
978-1-4244-1836-7
Electronic_ISBN :
978-1-4244-1837-4
DOI :
10.1109/ICDE.2008.4497502