Title :
Correlating summarization of a pair of multilingual documents
Author :
Ji, Xiang ; Zha, Hongyuan
Author_Institution :
Dept. of Comput. Sci. & Eng., Pennsylvania State Univ., University Park, PA, USA
Abstract :
With the emergence of enormous amount of documents in multiple languages, it is desirable to construct text mining methods that can compare and highlight similarities of them. In this paper, we explore the research issue of comparative summarization for a pair of multilingual documents. A bipartite graph based algorithm is proposed to correlate textual content against sources in various languages. The algorithm aligns the (sub)topics of a pair of multilingual documents and summarizes their correlation by sentence extraction. A pair of documents in different languages is modeled with a weighted bipartite graph. A mutual reinforcement principle is applied to identify a dense subgraph of the weighted bipartite graph. Sentences corresponding to the subgraph are correlated well in textual content and convey the dominant shared topic of the pair of documents. As a further enhancement, a bi-clustering algorithm can first be used to partition the bipartite graph into several clusters, each containing sentences from the two documents. These clusters correspond to shared subtopics, and the above mutual reinforcement principle can be applied to extract topic sentences within each subtopic group.
Keywords :
data mining; graph theory; natural languages; text analysis; biclustering algorithm; bipartite graph-based algorithm; comparative summarization; multilingual documents; multiple languages; mutual reinforcement principle; sentence extraction; text mining; textual content; weighted bipartite graph; Algorithm design and analysis; Bipartite graph; Computer science; Data mining; Explosions; Feature extraction; Natural languages; Pressing; Web pages; Web sites;
Conference_Titel :
Research Issues in Data Engineering: Multi-lingual Information Management, 2003. RIDE-MLIM 2003. Proceedings. 13th International Workshop on
Print_ISBN :
0-7803-7868-7
DOI :
10.1109/RIDE.2003.1249844