DocumentCode :
3143062
Title :
Segmenting documents using multiple lexical features
Author :
Jobbins, Amanda C. ; Evett, Lindsay J.
Author_Institution :
Dept. of Comput., Nottingham Trent Univ., UK
fYear :
1999
fDate :
20-22 Sep 1999
Firstpage :
721
Lastpage :
724
Abstract :
A method is presented for segmenting documents into conceptually related areas. Determining the equivalence of text is often based on the number of word repetitions. This approach is unsuitable for detecting short segments because terms tend not to be repeated across just a few sentences. We investigate the contribution of two other lexical features to find related words: collocation and relation weights (which identify semantic relations). An experiment was conducted on a set of test data with known topic changes; the performances of the three features were independently compared. A combination of all features was the most reliable indicator of a topic change. In another experiment, CNN news summaries were segmented into their individual news stories. Precision and recall rates of around 90% are reported for news story boundary detection
Keywords :
computational linguistics; document image processing; text analysis; CNN news summaries; collocation; conceptually related areas; document segmentation; multiple lexical features; news stories; news story boundary detection; relation weights; semantic relations; short segments; word repetitions; Cellular neural networks; Concatenated codes; Filters; Testing; Thesauri;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Document Analysis and Recognition, 1999. ICDAR '99. Proceedings of the Fifth International Conference on
Conference_Location :
Bangalore
Print_ISBN :
0-7695-0318-7
Type :
conf
DOI :
10.1109/ICDAR.1999.791889
Filename :
791889
Link To Document :
بازگشت