• DocumentCode
    2866337
  • Title

    An experiment stemming non-traditional text

  • Author

    Nascimento, Mario A. ; Cunha, Adriano C R da

  • Author_Institution
    CNPTIA-EMBRAPA, Sao Paulo, Brazil
  • fYear
    1998
  • fDate
    9-11 Sep 1998
  • Firstpage
    75
  • Lastpage
    80
  • Abstract
    Stemming is a technique which aims to extract common suffixes of words. Thus, words which are literally different but have a common stem, may be abstracted by their common stem. The underlying goal when using a stemming technique is to improve recall, at the possible expense of precision loss. A well known technique for stemming text is M.F. Porter´s (1980) algorithm, which is based on a set of rules extracted from the English language. We argue that such an algorithm it is not efficient for non traditional texts, e.g., one made up mainly of medical terms. We thus investigate the use of a technique, called Peak-and-Plateau, which is based on tries, and compare it to Porter´s algorithm. Our experiments have shown that using Porter´s algorithm or none at all makes no difference as far as precision and recall goes. On the other hand using the Peak-and-Plateau technique we improved recall by about 15% and decreased precision by an average of 40%. Moreover it compressed the original text by 40% and the invented file by 45%
  • Keywords
    computational linguistics; information retrieval; tree data structures; trees (mathematics); word processing; English language; Peak-and-Plateau; common stem; common suffixes; medical terms; non traditional text stemming; precision loss; recall; rule extraction; stemming technique; text compression; tries; Fires; Information retrieval; Insects; Read only memory;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    String Processing and Information Retrieval: A South American Symposium, 1998. Proceedings
  • Conference_Location
    Santa Cruz de La Sierra
  • Print_ISBN
    0-8186-8664-2
  • Type

    conf

  • DOI
    10.1109/SPIRE.1998.712985
  • Filename
    712985