• DocumentCode
    53380
  • Title

    Document Clustering for Forensic Analysis: An Approach for Improving Computer Inspection

  • Author

    da Cruz Nassif, L.F. ; Hruschka, E.R.

  • Author_Institution
    Brazilian Fed. Police Dept., Sao Paulo, Brazil
  • Volume
    8
  • Issue
    1
  • fYear
    2013
  • fDate
    Jan. 2013
  • Firstpage
    46
  • Lastpage
    54
  • Abstract
    In computer forensic analysis, hundreds of thousands of files are usually examined. Much of the data in those files consists of unstructured text, whose analysis by computer examiners is difficult to be performed. In this context, automated methods of analysis are of great interest. In particular, algorithms for clustering documents can facilitate the discovery of new and useful knowledge from the documents under analysis. We present an approach that applies document clustering algorithms to forensic analysis of computers seized in police investigations. We illustrate the proposed approach by carrying out extensive experimentation with six well-known clustering algorithms (K-means, K-medoids, Single Link, Complete Link, Average Link, and CSPA) applied to five real-world datasets obtained from computers seized in real-world investigations. Experiments have been performed with different combinations of parameters, resulting in 16 different instantiations of algorithms. In addition, two relative validity indexes were used to automatically estimate the number of clusters. Related studies in the literature are significantly more limited than our study. Our experiments show that the Average Link and Complete Link algorithms provide the best results for our application domain. If suitably initialized, partitional algorithms (K-means and K-medoids) can also yield to very good results. Finally, we also present and discuss several practical results that can be useful for researchers and practitioners of forensic computing.
  • Keywords
    data mining; digital forensics; pattern clustering; text analysis; CSPA; K-medoids clustering; average link clustering; complete link clustering; computer forensic analysis; document clustering algorithms; forensic analysis; k-mean clustering; police investigations; single link clustering; text mining; unstructured text; Algorithm design and analysis; Clustering algorithms; Digital forensics; Pattern clustering; Text analysis; Text mining; Clustering; forensic computing; text mining;
  • fLanguage
    English
  • Journal_Title
    Information Forensics and Security, IEEE Transactions on
  • Publisher
    ieee
  • ISSN
    1556-6013
  • Type

    jour

  • DOI
    10.1109/TIFS.2012.2223679
  • Filename
    6327658