Title :
Using the Web 1T 5-Gram Database for Attribute Selection in Formal Concept Analysis to Correct Overstemmed Clusters
Author :
Hall, Guymon R. ; Taghva, Kazem
Author_Institution :
Dept. of Comput. Sci., Univ. of Nevada, Las Vegas, Las Vegas, NE, USA
Abstract :
As part of information retrieval processes, words are often stemmed to a common root. The Porter Stemming Algorithm operates as a rule-based suffix-removal process. Stemming can be viewed as a way to cluster related words together according to one common stem. Sometimes Porter includes words in a cluster that are un-related. This experiment attempts to correct this using Formal Concept Analysis (FCA). FCA is the process of formulating formal concepts from a given formal context. A formal context consists of objects and attributes, and a binary relation that indicates the attributes possessed by each object. A formal concept is formed by computing the closure of subsets of objects and attributes. Using the Cranfield document collection, this experiment crafted a comparison measure between each word in the stemmed cluster using the Google Web 1T 5-gram data set. Using FCA to correct the clusters, the results showed a varying level of success dependent upon the error threshold allowed.
Keywords :
Internet; formal concept analysis; information retrieval; search engines; FCA; Google Web 1T 5-gram data set; Porter stemming algorithm; Web 1T 5-gram database; attribute selection; binary relation; cranfield document collection; error threshold; formal concept analysis; information retrieval processes; overstemmed cluster correction; rule-based suffix-removal process; Algorithm design and analysis; Clustering algorithms; Context; Formal concept analysis; Standards; Testing; Training; formal concept analysis; information retrieval; stemming;
Conference_Titel :
Information Technology - New Generations (ITNG), 2015 12th International Conference on
Conference_Location :
Las Vegas, NV
Print_ISBN :
978-1-4799-8827-3
DOI :
10.1109/ITNG.2015.109