• DocumentCode
    607615
  • Title

    A proposal for corpus normalization

  • Author

    Karaoglan, Bahar ; Kisla, T. ; Dincer, Bkir T. ; Metin, S.K.

  • Author_Institution
    Uluslararasi Bilgisayar Enstitusu, Ege Univ., İzmir, Turkey
  • fYear
    2013
  • fDate
    24-26 April 2013
  • Firstpage
    1
  • Lastpage
    4
  • Abstract
    In order to compare work done under natural language processing, the corpora involved in different studies should be standardized/normalized. Entropy, used as language model performance metric, totally depends on signal information. Whereas, when language is considered semantic information should also be considered. Here we propose a metric that exploits Zipf´s and Heaps´ power laws to respresent semantic information in terms of signal information and estimates the amount of information anticipated from a corpus of given length in words. The proposed metric is tested on 20 different lengths of sub-corpora drawn from major corpus in Turkish (METU). While the entropy changed depending on the length of the corpus, the value of our proposed metric stayed almost constant which supports our claim about normalizing the corpus.
  • Keywords
    entropy; natural language processing; corpus normalization; entropy; natural language processing; power laws; semantic information; signal information; Computers; Educational institutions; Entropy; Measurement; Presses; Semantics; Weaving; corpus comparison; cross entropy; language model performance;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Signal Processing and Communications Applications Conference (SIU), 2013 21st
  • Conference_Location
    Haspolat
  • Print_ISBN
    978-1-4673-5562-9
  • Electronic_ISBN
    978-1-4673-5561-2
  • Type

    conf

  • DOI
    10.1109/SIU.2013.6531217
  • Filename
    6531217