Title :
Data extraction from Web forums based on similarity of page layout
Author :
Wang, Yun ; Li, Bicheng ; Lin, Chen
Author_Institution :
Inf. Process. Dept, Inf. Technol. Inst., Zhengzhou, China
Abstract :
Web forums contain a wealth of information resources. Forum data can be widely used in areas such as Internet community mining, information retrieval and public opinion analysis and so on. This paper solves the problems of what should be extracted and how to extract from the Web forums. Aimed at the limitation of current methods to extract data from Web forums, an automated method is proposed to extract metadata from Web forum pages. The method processes in two steps. We firstly recognizes the topic-block by making full use of the special layout of the Web forum pages, then extract metadata from the topic-block by making use of statistical regularity of the metadata, the whole process done without manual work. Experimental results show that this method performs well both in adjustability and accuracy.
Keywords :
Internet; Web sites; data mining; information retrieval; Internet community mining; Web forum pages; data extraction; information resources; information retrieval; page layout similarity; public opinion analysis; Data mining; Databases; Discussion forums; HTML; Information analysis; Information processing; Information resources; Information retrieval; Information technology; Visual effects; data extraction; similarity; web forum;
Conference_Titel :
Natural Language Processing and Knowledge Engineering, 2009. NLP-KE 2009. International Conference on
Conference_Location :
Dalian
Print_ISBN :
978-1-4244-4538-7
Electronic_ISBN :
978-1-4244-4540-0
DOI :
10.1109/NLPKE.2009.5313736