• DocumentCode
    2484800
  • Title

    Document Length Normalization by Statistical Regression

  • Author

    Lamprier, Sylvain ; Amghar, Tassadit ; Levrat, Bernard ; Saubion, Frederic

  • Author_Institution
    Univ. of Angers, Angers
  • Volume
    2
  • fYear
    2007
  • fDate
    29-31 Oct. 2007
  • Firstpage
    11
  • Lastpage
    18
  • Abstract
    The document-length normalization problem has been widely studied in the field of information retrieval. The cosine normalization (Baeza-Yates and Ribeiro-Neto, 1999), the maximum if normalization (Allan et al., 1997) and the byte length normalization (Robertson et al., 1992) are the most commonly used normalization techniques. In (Singhal et al., 1996), authors studied the retrieval probability of documents w.r.t. their size, using different similarity measures. They have shown that none of existing measures retrieve the documents of different lengths with the same probability. We first show here that the document and query sizes are indeed very influent on the similarity score expectation. Therefore, we propose to realize a statistical regression of the similarity scores distribution w. r. t. document and query sizes in order to normalize them. Experimental results appear to indicate that our approach, as well in the field of classical Information Retrieval as when applied to a document clustering process, allows to judge similarities really more fairly.
  • Keywords
    document handling; information retrieval; regression analysis; byte length normalization; cosine normalization; document length normalization; information retrieval; maximum if normalization; statistical regression; Artificial intelligence; Computer science; Frequency; Indexing; Information retrieval; Length measurement; Probability; Publishing; Registers; Size measurement;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Tools with Artificial Intelligence, 2007. ICTAI 2007. 19th IEEE International Conference on
  • Conference_Location
    Patras
  • ISSN
    1082-3409
  • Print_ISBN
    978-0-7695-3015-4
  • Type

    conf

  • DOI
    10.1109/ICTAI.2007.57
  • Filename
    4410350