Title :
Toward Parallel Document Clustering
Author :
Mogill, Jace A. ; Haglin, David J.
Author_Institution :
Pacific Northwest Nat. Lab., Richland, WA, USA
Abstract :
A key challenge to automated clustering of documents in large text corpora is the high cost of comparing documents in a multi-million dimensional document space. The Anchors Hierarchy is a fast data structure and algorithm for localizing data based on a triangle inequality obeying distance metric, the algorithm strives to minimize the number of distance calculations needed to cluster the documents into "anchors\´\´ around reference documents called "pivots\´\´. We extend the original algorithm to increase the amount of available parallelism and consider two implementations: a complex data structure which affords efficient searching, and a simple data structure which requires repeated sorting. The sorting implementation is integrated with a text corpora "Bag of Words\´\´ program and initial performance results of end-to-end document processing workflow are reported.
Keywords :
data structures; document handling; parallel processing; anchors hierarchy; automated clustering; distance metric; document processing workflow; fast data structure; large text corpora; multimillion dimensional document space; parallel document clustering; Clustering algorithms; Concurrent computing; Data structures; Indexes; Parallel algorithms; Semantics; Synchronization;
Conference_Titel :
Parallel and Distributed Processing Workshops and Phd Forum (IPDPSW), 2011 IEEE International Symposium on
Conference_Location :
Shanghai
Print_ISBN :
978-1-61284-425-1
Electronic_ISBN :
1530-2075
DOI :
10.1109/IPDPS.2011.327