Title :
FODEX -- Towards Generic Data Extraction from Web Forums
Author :
Pretzsch, Sebastian ; Muthmann, Klemens ; Schill, Alexander
Author_Institution :
Fac. of Comput. Sci., Tech. Univ. Dresden, Dresden, Germany
Abstract :
The web is a large source for valuable data. Today, this data is not only provided by professional publishers, but everyone in the form of user-generated content. A large part of such content is located in web forums. As platforms to share knowledge, they are easily accessible for everyone. However, their vast amount makes it hard to find discussions on a specific topic. Automatic systems can filter and point to relevant information. Unfortunately, the content is presented in a human-readable layout and is not intended to be processed by automatic systems. Therefore, it is necessary to separate the content in a web forum discussion from the layout before doing any further information mining. This paper presents FODEX - a system for automatic forum data extraction. It extracts data from any forum and matches it to a unified data schema.
Keywords :
Internet; information resources; FODEX; Web forums; World Wide Web; data source; generic data extraction; Accuracy; Data mining; Feature extraction; HTML; Layout; Message systems; User-generated content; Information Extraction; Social Media; Web Scraping;
Conference_Titel :
Advanced Information Networking and Applications Workshops (WAINA), 2012 26th International Conference on
Conference_Location :
Fukuoka
Print_ISBN :
978-1-4673-0867-0
DOI :
10.1109/WAINA.2012.134