• DocumentCode
    3696973
  • Title

    A Parallel Algorithm for Statistical Multiword Term Extraction from Very Large Corpora

  • Author

    Gonçalves;Joaquim F. Silva;Jose C. Cunha

  • Author_Institution
    Inst. Super. de Eng. de Lisboa, Inst. Politec. de Lisboa, Lisbon, Portugal
  • fYear
    2015
  • Firstpage
    219
  • Lastpage
    224
  • Abstract
    Multi-word Relevant Expressions (REs) can be defined as sequences of words (n grams) with strong semantic meaning, such as "ice melting" and "Ministère des Affaires Étrangères", useful in Information Retrieval, Document Clustering or Classification and Indexing of Documents. The need of extracting REs in several languages led research on statistical approaches rather than symbolic methods, since the former allow language-independence. Based on the assumption that REs have strong cohesion between their consecutive n-grams, the LocalMaxs algorithm is a language independent approach that extracts REs. Apart from its good precision, this extractor is time-consuming, being inoperable for Big Data if implemented in a sequential manner. This paper presents the first parallel and distributed version of this algorithm, achieving almost linear speedup and sizeup when processing corpora up to 1 billion words, using up to 54 virtual machines in a public cloud. This parallel version of the algorithm explores the statistical knowledge of the n grams in the corpus, to promote the locality of the references.
  • Keywords
    "Parallel processing","Cloud computing","Algorithm design and analysis","Servers","Data models","Random access memory","System analysis and design"
  • Publisher
    ieee
  • Conference_Titel
    High Performance Computing and Communications (HPCC), 2015 IEEE 7th International Symposium on Cyberspace Safety and Security (CSS), 2015 IEEE 12th International Conferen on Embedded Software and Systems (ICESS), 2015 IEEE 17th International Conference on
  • Type

    conf

  • DOI
    10.1109/HPCC-CSS-ICESS.2015.72
  • Filename
    7336167