Title :
Finding surprising patterns in textual data streams
Author :
Snowsill, Tristan ; Nicart, Florent ; Stefani, Marco ; De Bie, Tijl ; Cristianini, Nello
Author_Institution :
Intell. Syst. Lab., Univ. of Bristol, Bristol, UK
Abstract :
We address the task of detecting surprising patterns in large textual data streams. These can reveal events in the real world when the data streams are generated by online news media, emails, Twitter feeds, movie subtitles, scientific publications, and more. The volume of interest in such text streams often exceeds human capacity for analysis, such that automatic pattern recognition tools are indispensable. In particular, we are interested in surprising changes in the frequency of n-grams of words, or more generally of symbols from an unlimited alphabet size. Despite the exponentially large number of possible n-grams in the size of the alphabet (which is itself unbounded), we show how these can be detected efficiently. To this end, we rely on a data structure known as a generalised suffix tree, which is additionally annotated with a limited amount of statistical information. Crucially, we show how the generalised suffix tree as well as these statistical annotations can efficiently be updated in an on-line fashion.
Keywords :
data structures; pattern recognition; statistical analysis; text analysis; Twitter feeds; automatic pattern recognition tools; data structure; emails; generalised suffix tree; n-grams; online news media; statistical annotations; surprising pattern detection; textual data streams; Data structures; Event detection; Frequency estimation; Markov processes; Nickel; Testing; Time frequency analysis;
Conference_Titel :
Cognitive Information Processing (CIP), 2010 2nd International Workshop on
Conference_Location :
Elba
Print_ISBN :
978-1-4244-6457-9
DOI :
10.1109/CIP.2010.5604085