Title :
Lexical semantic based Bayesian model for adaptive wrapper generation
Author :
Kesavan, R. Nandhi ; Latha, K.
Author_Institution :
Comput. Sci. & Eng. Dept., Anna Univ. of Technol., Tiruchirappalli, India
Abstract :
This paper focuses on an unsupervised information extraction system. Two kinds of features related to the text fragments from the Web documents are investigated. The first type of feature is called, a site-invariant feature. The second type of feature is called a site-dependent feature. Feature selection algorithm is used for wrapper generation from the site invariant and site-dependent information. The wrapper is generated and the new attribute is also discovered and adapted with wrapper by using the Bayesian learning method and E-M algorithm along with the lexical semantics search method. Our wrapper can be able to adapt with the new unseen sites. Our system efficiency is evaluated based on some performance measures and the effectiveness of the system is evaluated by using the performance metrics, precision, recall, f-measure, true positive and false positive in the real time web sites.
Keywords :
Bayes methods; Internet; Web sites; expectation-maximisation algorithm; information retrieval; learning (artificial intelligence); text analysis; Bayesian learning method; E-M algorithm; Web documents; adaptive wrapper generation; expectation-maximisation algorithm; f-measure; false positive; feature selection algorithm; lexical semantic based Bayesian model; lexical semantics search method; performance metrics; precision; real time Web sites; recall; site-dependent feature; site-invariant feature; text fragments; true positive; unsupervised information extraction system; Bayesian methods; Data mining; Semantics; Training; Web pages; Bayesian learning; E-M algorithm; Unsupervised information extraction; lexical semantics;
Conference_Titel :
Data Science & Engineering (ICDSE), 2012 International Conference on
Conference_Location :
Cochin, Kerala
Print_ISBN :
978-1-4673-2148-8
DOI :
10.1109/ICDSE.2012.6281907