Title :
Domain-Independent Unsupervised Text Segmentation for Data Management
Author :
Sakahara, Makoto ; Okada, Shogo ; Nitta, Katsumi
Author_Institution :
Tokyo Inst. of Technol., Tokyo, Japan
Abstract :
In this study, we have proposed a domain-independent unsupervised text segmentation method, which is applicable to even if unseen single document. This proposed method segments text documents by evaluating similarity between sentences. It is generally difficult to calculate semantic similarity between words that comprise sentences when the domain knowledge is insufficient. This problem influences segmentation accuracy. To address this problem, we use word 2 vec to calculate semantic similarity between words. Using word 2 vec, we embed semantic relationships between words in a vector space by training with large domain-independent corpora. Furthermore, we combine semantic and collocation similarities, i.e., The features between words within a document. The proposed method applies this combined similarity to affinity propagation clustering. Similarity between sentences is defined based on the earth mover´s distance between the frequencies of the obtained topical clusters. After calculating similarity between sentences, segmentation boundaries are automatically optimized using dynamic programming. The experimental results obtained using two datasets show that the proposed method clearly outperforms state-of-the-art domain-independent approaches and obtains equal performance with state-of-the-art domain-dependent approaches such as those that use topic modeling.
Keywords :
dynamic programming; text analysis; unsupervised learning; affinity propagation clustering; collocation similarity; data management; domain knowledge; domain-independent unsupervised text segmentation method; dynamic programming; segmentation boundary; semantic similarity; similarity propagation clustering; text document segmentation; topic modeling; vector space; Correlation; Cost function; Data mining; Measurement; Semantics; Training; Vectors; domain-independent; text segmentation; unsupervised;
Conference_Titel :
Data Mining Workshop (ICDMW), 2014 IEEE International Conference on
Conference_Location :
Shenzhen
Print_ISBN :
978-1-4799-4275-6
DOI :
10.1109/ICDMW.2014.118