Title :
Exploiting Attribute Redundancy in Extracting Open Source Forge Websites
Author :
Li, Xiang ; Zhu, Yanxu ; Yin, Gang ; Wang, Tao ; Wang, Huaimin
Author_Institution :
Nat. Lab. for Parallel & Distrib. Process., Nat. Univ. of Defense Technol., Changsha, China
Abstract :
Open Source Forge (OSF) websites provide information on massive open source software projects, extracting these web data is important for open source research. Traditional extraction methods use string matching among pages to detect page template, which is time-consuming. A recent work published in VLDB exploits redundant entities among websites to detect web page coordinates of these entities. The experiment gives good results when these coordinates are used for extracting other entities of the target site. However, OSF websites have few redundant project entities. This paper proposes a modified version of that redundancy-based method tailored for OSF websites, which relies on a similar yet weaker presumption that entity attributes are redundant rather than whole entities. Like the previous work, we also construct a seed database to detect web page coordinates of the redundancies, but all at the attribute-level. In addition, we apply attribute name verification to reduce false positives during extraction. The experiment result indicates that our approach is competent in extracting OSF websites, in which scenario the previous method can not be applied.
Keywords :
Web sites; information retrieval; public domain software; string matching; OSF Website extraction; VLDB; Web data; Webpage coordinate detection; attribute name verification; attribute redundancy based method; attribute-level; entity attributes; false positive reduction; open source forge Websites; open source research; open source software projects; page template detection; seed database construction; string matching; Data mining; Databases; HTML; Licenses; Measurement; Redundancy; attribute; open source software; redundancy; verification; web extraction;
Conference_Titel :
Cyber-Enabled Distributed Computing and Knowledge Discovery (CyberC), 2012 International Conference on
Conference_Location :
Sanya
Print_ISBN :
978-1-4673-2624-7
DOI :
10.1109/CyberC.2012.12