• DocumentCode
    243597
  • Title

    Domain-Independent Unsupervised Text Segmentation for Data Management

  • Author

    Sakahara, Makoto ; Okada, Shogo ; Nitta, Katsumi

  • Author_Institution
    Tokyo Inst. of Technol., Tokyo, Japan
  • fYear
    2014
  • fDate
    14-14 Dec. 2014
  • Firstpage
    481
  • Lastpage
    487
  • Abstract
    In this study, we have proposed a domain-independent unsupervised text segmentation method, which is applicable to even if unseen single document. This proposed method segments text documents by evaluating similarity between sentences. It is generally difficult to calculate semantic similarity between words that comprise sentences when the domain knowledge is insufficient. This problem influences segmentation accuracy. To address this problem, we use word 2 vec to calculate semantic similarity between words. Using word 2 vec, we embed semantic relationships between words in a vector space by training with large domain-independent corpora. Furthermore, we combine semantic and collocation similarities, i.e., The features between words within a document. The proposed method applies this combined similarity to affinity propagation clustering. Similarity between sentences is defined based on the earth mover´s distance between the frequencies of the obtained topical clusters. After calculating similarity between sentences, segmentation boundaries are automatically optimized using dynamic programming. The experimental results obtained using two datasets show that the proposed method clearly outperforms state-of-the-art domain-independent approaches and obtains equal performance with state-of-the-art domain-dependent approaches such as those that use topic modeling.
  • Keywords
    dynamic programming; text analysis; unsupervised learning; affinity propagation clustering; collocation similarity; data management; domain knowledge; domain-independent unsupervised text segmentation method; dynamic programming; segmentation boundary; semantic similarity; similarity propagation clustering; text document segmentation; topic modeling; vector space; Correlation; Cost function; Data mining; Measurement; Semantics; Training; Vectors; domain-independent; text segmentation; unsupervised;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Data Mining Workshop (ICDMW), 2014 IEEE International Conference on
  • Conference_Location
    Shenzhen
  • Print_ISBN
    978-1-4799-4275-6
  • Type

    conf

  • DOI
    10.1109/ICDMW.2014.118
  • Filename
    7022635