DocumentCode
2352883
Title
Correlating summarization of a pair of multilingual documents
Author
Ji, Xiang ; Zha, Hongyuan
Author_Institution
Dept. of Comput. Sci. & Eng., Pennsylvania State Univ., University Park, PA, USA
fYear
2003
fDate
10-11 March 2003
Firstpage
39
Lastpage
46
Abstract
With the emergence of enormous amount of documents in multiple languages, it is desirable to construct text mining methods that can compare and highlight similarities of them. In this paper, we explore the research issue of comparative summarization for a pair of multilingual documents. A bipartite graph based algorithm is proposed to correlate textual content against sources in various languages. The algorithm aligns the (sub)topics of a pair of multilingual documents and summarizes their correlation by sentence extraction. A pair of documents in different languages is modeled with a weighted bipartite graph. A mutual reinforcement principle is applied to identify a dense subgraph of the weighted bipartite graph. Sentences corresponding to the subgraph are correlated well in textual content and convey the dominant shared topic of the pair of documents. As a further enhancement, a bi-clustering algorithm can first be used to partition the bipartite graph into several clusters, each containing sentences from the two documents. These clusters correspond to shared subtopics, and the above mutual reinforcement principle can be applied to extract topic sentences within each subtopic group.
Keywords
data mining; graph theory; natural languages; text analysis; biclustering algorithm; bipartite graph-based algorithm; comparative summarization; multilingual documents; multiple languages; mutual reinforcement principle; sentence extraction; text mining; textual content; weighted bipartite graph; Algorithm design and analysis; Bipartite graph; Computer science; Data mining; Explosions; Feature extraction; Natural languages; Pressing; Web pages; Web sites;
fLanguage
English
Publisher
ieee
Conference_Titel
Research Issues in Data Engineering: Multi-lingual Information Management, 2003. RIDE-MLIM 2003. Proceedings. 13th International Workshop on
ISSN
1066-1395
Print_ISBN
0-7803-7868-7
Type
conf
DOI
10.1109/RIDE.2003.1249844
Filename
1249844
Link To Document