DocumentCode
243597
Title
Domain-Independent Unsupervised Text Segmentation for Data Management
Author
Sakahara, Makoto ; Okada, Shogo ; Nitta, Katsumi
Author_Institution
Tokyo Inst. of Technol., Tokyo, Japan
fYear
2014
fDate
14-14 Dec. 2014
Firstpage
481
Lastpage
487
Abstract
In this study, we have proposed a domain-independent unsupervised text segmentation method, which is applicable to even if unseen single document. This proposed method segments text documents by evaluating similarity between sentences. It is generally difficult to calculate semantic similarity between words that comprise sentences when the domain knowledge is insufficient. This problem influences segmentation accuracy. To address this problem, we use word 2 vec to calculate semantic similarity between words. Using word 2 vec, we embed semantic relationships between words in a vector space by training with large domain-independent corpora. Furthermore, we combine semantic and collocation similarities, i.e., The features between words within a document. The proposed method applies this combined similarity to affinity propagation clustering. Similarity between sentences is defined based on the earth mover´s distance between the frequencies of the obtained topical clusters. After calculating similarity between sentences, segmentation boundaries are automatically optimized using dynamic programming. The experimental results obtained using two datasets show that the proposed method clearly outperforms state-of-the-art domain-independent approaches and obtains equal performance with state-of-the-art domain-dependent approaches such as those that use topic modeling.
Keywords
dynamic programming; text analysis; unsupervised learning; affinity propagation clustering; collocation similarity; data management; domain knowledge; domain-independent unsupervised text segmentation method; dynamic programming; segmentation boundary; semantic similarity; similarity propagation clustering; text document segmentation; topic modeling; vector space; Correlation; Cost function; Data mining; Measurement; Semantics; Training; Vectors; domain-independent; text segmentation; unsupervised;
fLanguage
English
Publisher
ieee
Conference_Titel
Data Mining Workshop (ICDMW), 2014 IEEE International Conference on
Conference_Location
Shenzhen
Print_ISBN
978-1-4799-4275-6
Type
conf
DOI
10.1109/ICDMW.2014.118
Filename
7022635
Link To Document