DocumentCode :
3209540
Title :
Automatic Data Extraction from Web Discussion Forums
Author :
Li, Suke ; Tang, Liyong ; Hu, Jianbin ; Chen, Zhong
Author_Institution :
Sch. of Electron. Eng. & Comput. Sci., Peking Univ., Beijing, China
fYear :
2009
fDate :
17-19 Dec. 2009
Firstpage :
219
Lastpage :
225
Abstract :
This paper presents an approach to extract information from Web discussion forums automatically. HTML tag paths built from a HTML DOM tree are employed to generate the post extraction template. Visual text features and HTML structure information in the same page are also combined together to extract author profile, posted date and post content automatically. Experiment results show that our approach is effective.
Keywords :
Internet; data analysis; hypermedia markup languages; text analysis; HTML DOM tree; HTML structure information; HTML tag paths; automatic data extraction; information extraction; visual text features; web discussion forums; Computer science; Computer science education; Data engineering; Data mining; Discussion forums; Educational technology; HTML; Laboratories; Navigation; Web pages; Data Extraction; Data Mining; Web Forum Mining;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Frontier of Computer Science and Technology, 2009. FCST '09. Fourth International Conference on
Conference_Location :
Shanghai
Print_ISBN :
978-0-7695-3932-4
Electronic_ISBN :
978-1-4244-5467-9
Type :
conf
DOI :
10.1109/FCST.2009.20
Filename :
5392915
Link To Document :
بازگشت