Title :
An automated management tool for unstructured data
Author :
Ceglowski, Maciej ; Coburn, Aaron ; Cuadrado, John L.
Abstract :
The rapidly growing quantity of online data has created a need for automated, content-based categorization and search tools. We describe an open-source, Web-based archive management, which uses latent semantic indexing, coupled with vector clustering techniques, to provide users with a fully searchable and automatically categorized interface to a data collection. The default English document parser included in the project uses part-of-speech tagging and recursive maximal noun phrase extraction to create a more effective term list for LSI than traditional stop list techniques. The archive interface supports multiple user views of the data collection. Advanced search features are implemented through relevance feedback, and do not require users to learn a query syntax.
Keywords :
Internet; content management; grammars; relevance feedback; search engines; English document parser; Web-based archive management; archive interface; automated content-based categorization; data collection; latent semantic indexing; online data; part-of-speech tagging; recursive maximal noun phrase extraction; relevance feedback; search tools; vector clustering techniques; Data mining; Educational technology; Feedback; Indexing; Information retrieval; Large scale integration; Open source software; Organizing; Speech; Tagging;
Conference_Titel :
Web Intelligence, 2003. WI 2003. Proceedings. IEEE/WIC International Conference on
Print_ISBN :
0-7695-1932-6
DOI :
10.1109/WI.2003.1241266