• DocumentCode
    185585
  • Title

    Non-standard words as features for text categorization

  • Author

    Beliga, Slobodan ; Martincic-Ipsic, Sanda

  • Author_Institution
    Dept. of Inf., Univ. of Rijeka, Rijeka, Croatia
  • fYear
    2014
  • fDate
    26-30 May 2014
  • Firstpage
    1165
  • Lastpage
    1169
  • Abstract
    This paper presents categorization of Croatian texts using Non-Standard Words (NSW) as features. NonStandard Words are: numbers, dates, acronyms, abbreviations, currency, etc. NSWs in Croatian language are determined according to Croatian NSW taxonomy. For the purpose of this research, 390 text documents were collected and formed the SKIPEZ collection with 6 classes: official, literary, informative, popular, educational and scientific. Text categorization experiment was conducted on three different representations of the SKIPEZ collection: in the first representation, the frequencies of NSWs are used as features; in the second representation, the statistic measures of NSWs (variance, coefficient of variation, standard deviation, etc.) are used as features; while the third representation combines the first two feature sets. Naive Bayes, CN2, C4.5, kNN, Classification Trees and Random Forest algorithms were used in text categorization experiments. The best categorization results are achieved using the first feature set (NSW frequencies) with the categorization accuracy of 87%. This suggests that the NSWs should be considered as features in highly inflectional languages, such as Croatian. NSW based features reduce the dimensionality of the feature space without standard lemmatization procedures, and therefore the bag-of-NSWs should be considered for further Croatian texts categorization experiments.
  • Keywords
    pattern classification; text analysis; C4.5; CN2; Croatian NSW taxonomy; Croatian language; Croatian text categorization; Naive Bayes; SKIPEZ collection; bag-of-NSWs; classification trees; inflectional languages; kNN; nonstandard words; random forest algorithms; standard lemmatization procedures; text categorization experiment; Accuracy; Educational institutions; Feature extraction; Support vector machine classification; Taxonomy; Text categorization; Vectors; accuracy; collection representation; features; non-standard words; text categorization;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Information and Communication Technology, Electronics and Microelectronics (MIPRO), 2014 37th International Convention on
  • Conference_Location
    Opatija
  • Print_ISBN
    978-953-233-081-6
  • Type

    conf

  • DOI
    10.1109/MIPRO.2014.6859744
  • Filename
    6859744