DocumentCode :
3318089
Title :
Extract list data from semi-structured document using clustering
Author :
Xu, Hui ; Li, Juanzi ; Xu, Peng
Author_Institution :
Dept. of Comput. Sci. & Technol., Tsinghua Univ., Beijing, China
fYear :
2005
fDate :
30 Oct.-1 Nov. 2005
Firstpage :
559
Lastpage :
564
Abstract :
This paper is concerned with list data extraction from semi-structured documents. By list data extraction, we mean extracting data from lists and grouping it by rows and columns. List, which has structured characteristics, is used to store highly structured and database-like information in many semi-structured documents, such as business annual reports, online airport listings, catalogs, hotel directories, etc. List data extraction is of benefit to text mining applications on semi-structured documents. Several research efforts have been done on structured data extraction from semi-structured documents by utilizing the word layout and arrangement information. However, as far as we know, few studies have been sufficiently investigated on list data extraction making use of the semantic information previously. In this paper, we propose a clustering based method making use of not only the layout and arrangement information but also the semantic information of words for this extraction task. We show experimental results on plain-text annual reports from Shanghai Stock Exchange, in which 73.49% of the lists were extracted correctly.
Keywords :
data mining; document handling; information retrieval; pattern clustering; text analysis; business annual reports; catalogs; clustering; hotel directories; list data extraction; online airport listings; semantic information; semistructured document; text mining; Airports; Catalogs; Clustering algorithms; Computer applications; Computer science; Data mining; Databases; Humans; Stock markets; Text mining;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Natural Language Processing and Knowledge Engineering, 2005. IEEE NLP-KE '05. Proceedings of 2005 IEEE International Conference on
Print_ISBN :
0-7803-9361-9
Type :
conf
DOI :
10.1109/NLPKE.2005.1598800
Filename :
1598800
Link To Document :
بازگشت