• DocumentCode
    1857602
  • Title

    Domain-independent topic segmentation using a string kernel on recognized sub-word sequences

  • Author

    Sadohara, K. ; Lee, S.-w. ; Kojima, H.

  • Author_Institution
    Nat. Inst. of Adv. Ind. Sci. & Technol. (AIST), Tsukuba
  • fYear
    2006
  • fDate
    10-13 Dec. 2006
  • Firstpage
    30
  • Lastpage
    33
  • Abstract
    The goal of the present paper is to explore the feasibility of a topic segmentation method without using large vocabulary continuous speech recognition (LVCSR). The proposed method is domain-independent in the sense that it is not constrained by vocabulary and does not require training data. For a sequence of sub-word units obtained using a continuous sub-word recognizer, the proposed method merges similar adjacent parts of the sequence in an agglomerative manner to produce a hierarchical cluster tree. The proposed method uses a string kernel to efficiently compute the similarity between two strings of sub-word units based on the frequencies of any sub-strings appearing in the strings. By carefully excluding the influence of the sub-strings that are irrelevant to the topic of interest, topically coherent clusters are formed without linguistic knowledge. An empirical study on a Japanese news speech corpus shows that the method performs better than a topic segmenter using LVCSR.
  • Keywords
    pattern clustering; string matching; vocabulary; word processing; Japanese news speech corpus; continuous sub-word recognizer; domain-independent topic segmentation; hierarchical cluster tree; recognized sub-word sequences; string kernel; vocabulary; Clustering algorithms; Clustering methods; Frequency; Kernel; Paper technology; Speech analysis; Speech recognition; Training data; Unsupervised learning; Vocabulary;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Spoken Language Technology Workshop, 2006. IEEE
  • Conference_Location
    Palm Beach
  • Print_ISBN
    1-4244-0872-5
  • Type

    conf

  • DOI
    10.1109/SLT.2006.326809
  • Filename
    4123354