• DocumentCode
    3313741
  • Title

    Automatic extraction and filtration of multiword units1

  • Author

    Ying Liu ; Zheng Tie

  • Author_Institution
    Dept. of Chinese Language & Literature, Tsinghua Univ. Beijing, Beijing, China
  • Volume
    4
  • fYear
    2011
  • fDate
    26-28 July 2011
  • Firstpage
    2591
  • Lastpage
    2595
  • Abstract
    We use five statistical models including Dice coefficient (Dice), Φ2 coefficient (Φ2), log likelihood ratio (LLR), symmetrical conditional probability (SCP), and normalized expectation(NE) to extract multiword unit candidates from patent corpus. We compare the results from five models and find the number of multiword unit candidates using NE is the most and the precision of Dice is the maximal, but the number of multiword unit candidates using Dice is the least and the precision of SCP is the minimum. Next the multiword unit candidates are filtrated using these filtration strategies including stop words, the threshold, higher frequency, first stop words, last stop words, and context entropy. After filtration, the number of multiword units using NE is the most and the precision of Dice is the maximal, but the number of multiword units using Dice is the least and the precision of SCP is the minimum. Each filtration strategy all help to identify the wrong or unreasonable multiword units and improve the precision of multiword units.
  • Keywords
    information filtering; probability; text analysis; Φ2 coefficient; automatic extraction; context entropy; dice coefficient; filtration; log likelihood ratio; multiword unit candidate; normalized expectation; patent corpus; statistical model; stop word; symmetrical conditional probability; Computers; Correlation; Equations; Filtration; Mathematical model; Patents; Syntactics; Ф2; Dice; LLR; NE; SCP; extract; filtrate; multiword unit;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Fuzzy Systems and Knowledge Discovery (FSKD), 2011 Eighth International Conference on
  • Conference_Location
    Shanghai
  • Print_ISBN
    978-1-61284-180-9
  • Type

    conf

  • DOI
    10.1109/FSKD.2011.6020036
  • Filename
    6020036