• Title of article

    Long distance bigram models applied to word clustering

  • Author/Authors

    Bassiou، نويسنده , , Nikoletta and Kotropoulos، نويسنده , , Constantine، نويسنده ,

  • Issue Information
    روزنامه با شماره پیاپی سال 2011
  • Pages
    14
  • From page
    145
  • To page
    158
  • Abstract
    Two novel word clustering techniques are proposed which employ long distance bigram language models. The first technique is built on a hierarchical clustering algorithm and minimizes the sum of Mahalanobis distances of all words after a cluster merger from the centroid of the class created by merging. The second technique resorts to probabilistic latent semantic analysis (PLSA). Next, interpolated long distance bigrams are considered in the context of the aforementioned clustering techniques. Experiments conducted on the English Gigaword corpus (second edition) demonstrate that: (1) the long distance bigrams, when employed in the two clustering techniques under study, yield word clusters of better quality than the baseline bigrams; (2) the interpolated long distance bigrams outperform the long distance bigrams in the same respect; (3) the long distance bigrams perform better than the bigrams, which incorporate trigger-pairs selected at various distances; and (4) the best word clustering is achieved by the PLSA that employs interpolated long distance bigrams. Both proposed techniques outperform spectral clustering based on k-means. To assess objectively the quality of the created clusters, relative cluster validity indices are estimated as well as the average cluster sense precision, the average cluster sense recall, and the F-measure are computed by exploiting ground truth extracted from the WordNet.
  • Keywords
    Language modeling , Distance bigrams , Trigger-pairs , Cluster dispersion , Probabilistic latent semantic analysis , Spectral clustering , Cluster sense recall , wordnet , Word clustering , Relative cluster validity indices , Cluster sense precision
  • Journal title
    PATTERN RECOGNITION
  • Serial Year
    2011
  • Journal title
    PATTERN RECOGNITION
  • Record number

    1733886