DocumentCode
1850366
Title
FODEX -- Towards Generic Data Extraction from Web Forums
Author
Pretzsch, Sebastian ; Muthmann, Klemens ; Schill, Alexander
Author_Institution
Fac. of Comput. Sci., Tech. Univ. Dresden, Dresden, Germany
fYear
2012
fDate
26-29 March 2012
Firstpage
821
Lastpage
826
Abstract
The web is a large source for valuable data. Today, this data is not only provided by professional publishers, but everyone in the form of user-generated content. A large part of such content is located in web forums. As platforms to share knowledge, they are easily accessible for everyone. However, their vast amount makes it hard to find discussions on a specific topic. Automatic systems can filter and point to relevant information. Unfortunately, the content is presented in a human-readable layout and is not intended to be processed by automatic systems. Therefore, it is necessary to separate the content in a web forum discussion from the layout before doing any further information mining. This paper presents FODEX - a system for automatic forum data extraction. It extracts data from any forum and matches it to a unified data schema.
Keywords
Internet; information resources; FODEX; Web forums; World Wide Web; data source; generic data extraction; Accuracy; Data mining; Feature extraction; HTML; Layout; Message systems; User-generated content; Information Extraction; Social Media; Web Scraping;
fLanguage
English
Publisher
ieee
Conference_Titel
Advanced Information Networking and Applications Workshops (WAINA), 2012 26th International Conference on
Conference_Location
Fukuoka
Print_ISBN
978-1-4673-0867-0
Type
conf
DOI
10.1109/WAINA.2012.134
Filename
6185496
Link To Document