Abstract :
Nowadays, the task of security monitoring for vast Internet content has the problem of time efficiency. In improving the efficiency, we have studied and compared several typical Multi-pattern searching algorithms such as AC and Wu-Manber algorithms both in English and Chinese environment. Testing results show that the classic Multi-pattern matching algorithms are less efficient in the Chinese environment than in English. And we analyze the factors that cause this: Chinese characters are much bigger a set than English 26 letters, which repeat much but Chinese dose not in a text, and Chinese key word is much shorter than English. According to these factors, this paper presents a novel fast multi-pattern matching algorithm, Byte-Coding algorithm (BC) and a fast semantic content filtering algorithm based on the simple semantic characteristics. By adding the weights of different sizes to the key words, we can improve the accuracy and the speed of filtering system. We thoroughly compare our algorithm with the conventional ones in the speed of filtering. The results show that in multi-pattern mode its speed is at least ten times faster than the traditional AC, WM algorithm and more scaleable with the number of patterns increasing; in simple semantic with frequency calculations mode, this algorithm is still suitable and much faster. The algorithm can also apply to multi-languages environment and rapid parallel or distributed monitoring system as a core module.
Keywords :
Internet; character recognition; natural languages; pattern matching; security of data; Chinese characters; Internet content; core module; distributed monitoring system; double-byte-coding algorithm; fast semantic content security filtering; multipattern matching algorithm; parallel monitoring system; security monitoring; Algorithm design and analysis; Filtering algorithms; Information filtering; Information filters; Information security; Internet; Monitoring; National security; Pattern matching; Testing; Byte-Coding algorithm; Internet content security; multi-pattern matching; text filtering;