DocumentCode
1607797
Title
InForCE: Forum data crawling with information extraction
Author
Zhang, Can ; Zhang, Jingwei
Author_Institution
Inst. of Massive Comput., East China Normal Univ., Shanghai, China
fYear
2010
Firstpage
367
Lastpage
373
Abstract
Forum data acquisition is the prerequisite of forum data analysis, such as opinion analysis, on-line advertisement, and so on. Since the structure of forum data usually has casual relationships with the page structure, effective forum data acquisition requires the integration of Web pages crawling and information extraction. In this paper, we propose a system InForCE for this purpose. The system includes two parts. First, we download Web pages from different forums and generate HTML documents. Second, structured data are extracted from HTML documents in the light of user requirements. During the extraction process, a novel algorithm has been proposed to transform user requirement into XSLT automatically. Our experimental results show that structured data extraction is feasible and efficient.
Keywords
Internet; data analysis; hypermedia markup languages; information retrieval; HTML document generation; InForCE; Web pages crawling; XSLT; forum data acquisition; forum data analysis; forum data crawling; information extraction; online advertisement; opinion analysis; Crawlers; Data analysis; Data mining; HTML; Transforms; Web pages; XML;
fLanguage
English
Publisher
ieee
Conference_Titel
Universal Communication Symposium (IUCS), 2010 4th International
Conference_Location
Beijing
Print_ISBN
978-1-4244-7821-7
Type
conf
DOI
10.1109/IUCS.2010.5666252
Filename
5666252
Link To Document