Title :
Ranking documents by internal variability
Author :
Skillicorn, D.B. ; Chandrasekaran, P.K.
Author_Institution :
Sch. of Comput., Queen´´s Univ., Kingston, ON, Canada
Abstract :
An analyst, presented with a corpus too large to read every document, must find some selection mechanism. A model for interestingness can be used to rank the documents so that only the subset at the top of the ranking need be examined. However, in many open-source intelligence settings, such a model is not known in advance. We design three measures for ranking documents by internal variability as a weak surrogate for interestingness. Selecting those documents ranked highly by these measures selects a superset of the documents an analyst might need to read, no matter what the specific model, and reduces the size of the corpus by an order of magnitude. We also discover that many corpora contain documents that are highly variable, but not interesting, and show how to remove them.
Keywords :
document handling; internal variability; open source intelligence settings; ranking documents; selection mechanism; Analytical models; Bayesian methods; Educational institutions; Humans; Loss measurement; Shape; Text analysis;
Conference_Titel :
Intelligence and Security Informatics (ISI), 2012 IEEE International Conference on
Conference_Location :
Arlington, VA
Print_ISBN :
978-1-4673-2105-1
DOI :
10.1109/ISI.2012.6284292