Title :
InForCE: Forum data crawling with information extraction
Author :
Zhang, Can ; Zhang, Jingwei
Author_Institution :
Inst. of Massive Comput., East China Normal Univ., Shanghai, China
Abstract :
Forum data acquisition is the prerequisite of forum data analysis, such as opinion analysis, on-line advertisement, and so on. Since the structure of forum data usually has casual relationships with the page structure, effective forum data acquisition requires the integration of Web pages crawling and information extraction. In this paper, we propose a system InForCE for this purpose. The system includes two parts. First, we download Web pages from different forums and generate HTML documents. Second, structured data are extracted from HTML documents in the light of user requirements. During the extraction process, a novel algorithm has been proposed to transform user requirement into XSLT automatically. Our experimental results show that structured data extraction is feasible and efficient.
Keywords :
Internet; data analysis; hypermedia markup languages; information retrieval; HTML document generation; InForCE; Web pages crawling; XSLT; forum data acquisition; forum data analysis; forum data crawling; information extraction; online advertisement; opinion analysis; Crawlers; Data analysis; Data mining; HTML; Transforms; Web pages; XML;
Conference_Titel :
Universal Communication Symposium (IUCS), 2010 4th International
Conference_Location :
Beijing
Print_ISBN :
978-1-4244-7821-7
DOI :
10.1109/IUCS.2010.5666252