DocumentCode :
1982232
Title :
A template-based forum posts content extraction method
Author :
Jiaquan Si ; Wei Wang
Author_Institution :
Inf. Security Res. Center, Harbin Eng. Univ., Harbin, China
fYear :
2011
fDate :
16-18 Sept. 2011
Firstpage :
38
Lastpage :
41
Abstract :
In the management of the online public opinion and Internet intelligent information, people need to obtain the content of the forum threads for further research on the topic emotion and the dissemination of forum topics. This paper presents a method based on templates to extract web forum contents. Proposed method overcomes the problem which caused by the change of the web pages structures and contents, and can extract the content effectively. In this method, web pages are translated into DOM (Document Object Model) tree which will be matched by the templates. In cases when it doesn´t match, Fuzzy matching and repetition matching is used. Finally the web pages contents are extracted and obtained. The experiment shows that this method has a better accuracy and recall rate.
Keywords :
Internet; content management; document handling; pattern matching; DOM tree; Internet intelligent information; Web forum content extraction; Web page content; content extraction method; document object model; forum thread; fuzzy matching; online public opinion; repetition matching; template-based forum post; topic emotion; Algorithm design and analysis; Data mining; Educational institutions; Feature extraction; Floors; Internet; Web pages; DOM tree; contents extraction; forum; online public opinion; template;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Electrical and Control Engineering (ICECE), 2011 International Conference on
Conference_Location :
Yichang
Print_ISBN :
978-1-4244-8162-0
Type :
conf
DOI :
10.1109/ICECENG.2011.6057476
Filename :
6057476
Link To Document :
بازگشت