Title :
Signature word extracting research based on web metadata
Author :
Pang, Ning ; Lai, Zhen-dan
Author_Institution :
Coll. of Appl. Sci., Taiyuan Univ. of Sci. & Technol., Taiyuan, China
Abstract :
Signature word of the text extracting is a useful technique which can abstract Web page text, as well as it provides technical support for text classification, information extraction and other related tasks. This paper attempts to partition document into a hierarchical structure by parsing the semantic distance between each adjacent paragraph in the web page content. On the basis of the hierarchical structure we use the metadata and special tags of the HTML to design a weighting function by considering the factor of the frequency, length and location of the word. Finally, various location factors on the system´s contribution are comparative analyzed.
Keywords :
Internet; classification; hypermedia markup languages; information retrieval; meta data; text analysis; HTML tags; Web metadata; Web page content; Web page text abstraction; hierarchical structure; information extraction; partition document; semantic distance parsing; signature word extraction research; text classification; text extraction; weighting function design; word frequency factors; word length factors; word location factors; Computational modeling; Data mining; Educational institutions; HTML; Noise; Semantics; Web pages; signature word extracting; web metadata; weighting function;
Conference_Titel :
Instrumentation & Measurement, Sensor Network and Automation (IMSNA), 2012 International Symposium on
Conference_Location :
Sanya
Print_ISBN :
978-1-4673-2465-6
DOI :
10.1109/MSNA.2012.6324633