DocumentCode :
3575283
Title :
Hidden web data extraction using wordnet ontology´s
Author :
Ponnam, Vidya Sagar ; Anne, V. P. Krishna ; Konki, Venkata Kishore
Author_Institution :
Dept. of Comput. Sci. & Eng., K.L. Univ., Vaddeswaram, India
fYear :
2014
Firstpage :
1
Lastpage :
4
Abstract :
In response to the search engine crawler´s queries, the application servers generate the information and deliver it directly to the user. The generated information forms the hidden web (deep web or invisible web) because the information is usually enwrapped in Hyper Text Markup Language (HTML) pages as data records. Due to the dynamic nature of the generated data records from the hidden web, current search engines (either general or commercial) are unable to index the HTML page accordingly. Propose to develop an Ontological Wrapper (OW) for the extraction and alignment of data records using lightweight ontological technique driven by word net repositories. Main component of the wrapper involves checking the similarity of data records and not just visual cues by stripping the html aspects. There are three main components in our wrapper design, namely, parsing process performed with TEXT MDL Algorithm, extraction initiated with irrelevant HTML stripping, and alignment of data records for classification. After the three step process, we are left with pure text data records stripped of the html content which can be searched over by humans or search engine crawlers. Our Approach is almost adaptable to most websites of distinguished visual cues and yields better data extraction results at better speeds than prior systems and a practical implementation validates our claim.
Keywords :
Internet; Web sites; data mining; hypermedia markup languages; ontologies (artificial intelligence); query processing; search engines; text analysis; HTML content; HTML pages; OW; TEXT MDL algorithm; Web mining; Websites; WordNet ontology; WordNet repositories; data record alignment; data record extraction; data record similarity; deep Web; hidden Web data extraction; hyper text markup language pages; invisible Web; irrelevant HTML stripping; ontological technique; ontological wrapper; parsing process; search engine crawler; visual cues; HTML; Information filters; Integrated circuits; Labeling; Performance evaluation; Servers; DOM tree; HTML Text; MinHash; Ontological Wrapper; Template extraction; clustering; minimum description length principle;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
IT in Business, Industry and Government (CSIBIG), 2014 Conference on
Print_ISBN :
978-1-4799-3063-0
Type :
conf
DOI :
10.1109/CSIBIG.2014.7056971
Filename :
7056971
Link To Document :
بازگشت