Title :
Discovering interchangeable words from string databases
Author :
Alvarez, Marco A. ; Lim, SeungJin
Author_Institution :
Dept. of Comput. Sci., Utah State Univ., Logan, UT
Abstract :
This paper presents a solution for the problem of finding interchangeable words in the context of an input collection of strings. Interchangeable words are words that can be replaced indistinctly in phrases or free text without deviating its actual meaning. Under restricted conditions, pairs of interchangeable might be useful for data deduplication, copy detection, software localization, among others. The calculation of the degree of interchangeability involves the accurate calculation of semantic similarity between pairs of words and the search for candidate pairs in the overall search space imposed by the input collection. The solution presented in this paper is composed by a search method for candidate pairs using the Levenshtein distance algorithm and a novel algorithm - SSA -for calculating the semantic similarity between words. The proposed solution was implemented and tested within a real world application related to a string message database from a software development company. The system was used to build an ontology with clusters of interchangeable words.
Keywords :
database management systems; word processing; Levenshtein distance algorithm; copy detection; data deduplication; interchangeable words; semantic similarity; software localization; string databases; string message database; Application software; Clustering algorithms; Computer science; Databases; Educational institutions; Marine animals; Ontologies; Programming; Search methods; Software testing;
Conference_Titel :
Digital Information Management, 2007. ICDIM '07. 2nd International Conference on
Conference_Location :
Lyon
Print_ISBN :
978-1-4244-1475-8
Electronic_ISBN :
978-1-4244-1476-5
DOI :
10.1109/ICDIM.2007.4444195