• DocumentCode
    3586243
  • Title

    A Comparative Study of Likelihood Ratio Based Forensic Text Comparison Procedures: Multivariate Kernel Density with Lexical Features vs. Word N-grams vs. Character N-grams

  • Author

    Ishihara, Shunichi

  • Author_Institution
    Dept. of Linguistics, Australian Nat. Univ., Canberra, ACT, Australia
  • fYear
    2014
  • Firstpage
    1
  • Lastpage
    11
  • Abstract
    This is a comparative study to empirically investigate the performances of three different procedures for calculating authorship attribution likelihood ratios (LR). The procedures to be compared are: 1) a procedure based on multivariate kernel density (MVKD) with lexical features; 2) a procedure based on word N-grams; and 3) a procedure based on character N-grams. Furthermore, the best-performing LRs of these three procedures are fused into combined single LRs using a logistic-regression fusion, in order to investigate the extent of the improvement/deterioration that the fusion brings about. This study uses chatlog messages, which were presented as evidence to prosecute paedophiles, for testing. The numbers of word tokens used to model the authorship attribution of each message group are 500 and 1000 words. This was done to examine the effect of sample size on the performance of a system. The performance of a system is assessed with regard to its validity (= accuracy) and reliability (= precision) using the log-likelihood-ratio cost (Cllr) and 95% credible intervals (CI), respectively. While describing the different characteristics of these three procedures in their outcomes, this study demonstrates that the MVKD procedure was the best-performing procedure out of the three in terms of Cllr . This study also demonstrates that a logistic-regression fusion is useful for combining the LRs obtained from the three procedures in question, resulting in a good improvement in performance.
  • Keywords
    digital forensics; feature selection; regression analysis; text analysis; LR; MVKD; authorship attribution likelihood ratio; character n-gram; chatlog message; forensic text comparison; lexical feature; logistic-regression fusion; multivariate kernel density; paedophile prosecution; word n-gram; Calibration; Covariance matrices; DNA; Databases; Forensics; Kernel; Testing; 95% credible intervals; Tippett plots; character N-grams; forensic text comparison; lexical features; likelihood ratio; log likelihood ratio cost; logistic-regression fusion; multivariate kernel density; word N-grams;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Cybercrime and Trustworthy Computing Conference (CTC), 2014 Fifth
  • Print_ISBN
    978-1-4799-8824-2
  • Type

    conf

  • DOI
    10.1109/CTC.2014.9
  • Filename
    7087322