DocumentCode :
2216860
Title :
Analysis of grammatical evolutionary approaches to regular expression induction
Author :
Gonzalez-Pardo, Antonio ; Camacho, David
Author_Institution :
Dept. de Ing. Inf., Univ. Autonoma de Madrid, Madrid, Spain
fYear :
2011
fDate :
5-8 June 2011
Firstpage :
639
Lastpage :
646
Abstract :
Regular expressions, or regexes, have been used traditionally as a pattern matching tool to search for structures in a set of objects, like flies, text documents or folders. Pattern matching can be used to look for flies whose name contains a given string, to search flies that contain a specific pattern within them, or simply to extract text in a set of documents. It is very popular to apply regexes to detect and extract patterns that represent phone numbers, URLs, email addresses, etc. These kind of information can be characterized because it has a well defined structure. Nevertheless, regexes are not very frequently used because its high complexity in both, syntax and grammatical rules, makes regexes difficult to understand. For this reason, the development of programs able to automatically generate, and evaluate, regexes has become a valuable task. This work analyzes the performance of different grammatical evolutionary approaches in the generation of regexes able to extract URL patterns. Four different types of grammars have been evaluated: a context-free grammar, a context-free grammar with a penalized fitness function, an extensible context-free grammar, and a Christiansen grammar. For the considered problem, the experimental results show that the best performance of the system, measured as cumulative success rate, is achieved using Christiansen grammars.
Keywords :
computational linguistics; context-free grammars; evolutionary computation; string matching; text analysis; Christiansen grammar; URL patterns; email addresses; extensible context-free grammar; grammars; grammatical evolutionary approaches; grammatical rules; pattern matching tool; penalized fitness function; regexes; regular expression; search files; string matching; syntax rules; text documents; text extraction; Context; Evolution (biology); Evolutionary computation; Genetic algorithms; Grammar; Positron emission tomography; Production;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Evolutionary Computation (CEC), 2011 IEEE Congress on
Conference_Location :
New Orleans, LA
ISSN :
Pending
Print_ISBN :
978-1-4244-7834-7
Type :
conf
DOI :
10.1109/CEC.2011.5949679
Filename :
5949679
Link To Document :
بازگشت