DocumentCode :
3256832
Title :
A splog filtering method based on string copy detection
Author :
Takeda, Takaharu ; Takasu, Atsuhiro
Author_Institution :
Grad. Univ. for Adv. Studies, Hayama
fYear :
2008
fDate :
4-6 Aug. 2008
Firstpage :
543
Lastpage :
548
Abstract :
Recently many people publicize their blogs and the blogosphere becomes an important information source. It is used for various purposes such as analyzing trends and reputations, marketing, etc. One problem of blogosphere is spam like e-mails and web links. There are many spam blogs (splogs) that are generated to make users to access specific sites. This paper proposes a splog filtering method. Splog is usually generated automatically by copying words and phrases from other documents. Therefore, the proposed method detects strings appearing in multiple blogs and uses a copy rate of strings as a key feature for splog filtering. To evaluate the proposed method, we constructed an evaluation corpus by gathering blogs randomly during a certain period of time and manually judged whether each blog is splog or not. The experiment using this corpus reveals several features of splog filtering by copy string detection. The proposed method uses the suffix array for copied substring detection and it can judge each blog with time complexity of (m2 log n) where n and m denote total length of documents used for copy detection and the lengths of the blog to be judged, respectively.
Keywords :
Web sites; computational complexity; electronic mail; information filtering; security of data; Web links; blogosphere; document length; spam like e-mail; splog filtering method; string copy detection; time complexity; Blogs; Character generation; Electronic mail; Frequency; Informatics; Information analysis; Information filtering; Information filters; Search engines; Text categorization;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Applications of Digital Information and Web Technologies, 2008. ICADIWT 2008. First International Conference on the
Conference_Location :
Ostrava
Print_ISBN :
978-1-4244-2623-2
Electronic_ISBN :
978-1-4244-2624-9
Type :
conf
DOI :
10.1109/ICADIWT.2008.4664407
Filename :
4664407
Link To Document :
بازگشت