• DocumentCode
    59184
  • Title

    Compact Multiview Representation of Documents Based on the Total Variability Space

  • Author

    Morchid, Mohamed ; Bouallegue, Mohamed ; Dufour, Richard ; Linares, Georges ; Matrouf, Driss ; De Mori, Renato

  • Author_Institution
    Lab. d´Inf. d´Avignon (LIA), Univ. of Avignon, Avignon, France
  • Volume
    23
  • Issue
    8
  • fYear
    2015
  • fDate
    Aug. 2015
  • Firstpage
    1295
  • Lastpage
    1308
  • Abstract
    Mapping text documents in an LDA-based topic-space is a classical way to extract high-level representation of text documents. Unfortunately, LDA is highly sensitive to hyper-parameters related to the number of classes, or word and topic distribution, and there is no systematic way to pre-estimate optimal configurations. Moreover, various hyper-parameter configurations offer complementary views on the document. In this paper, we propose a method based on a two-step process that, first, expands the representation space by using a set of topic spaces and, second, compacts the representation space by removing poorly relevant dimensions. These two steps are based respectively on multi-view LDA-based representation spaces and factor-analysis models. This model provides a view-independent representation of documents while extracting complementary information from a massive multi-view representation. Experiments are conducted on the DECODA conversation corpus and the Reuters-21578 textual dataset. Results show the efficiency of the proposed multiview compact representation paradigm. The proposed categorization system reaches an accuracy of 86.5% with automatic transcriptions of conversations from DECODA corpus and a Macro-F1 of 80% during a classification task of the well-known Reuters-21578 corpus, with a significant gain compared to the baseline (best single topic space configuration), as well as methods and document representations previously studied.
  • Keywords
    pattern classification; text analysis; DECODA conversation corpus; LDA-based topic-space; Reuters-21578 textual dataset; categorization system; classification task; compact multiview document representation; factor analysis models; hyper-parameter configuration; linear discriminant analysis; representation space; text document mapping; text document representation; total variability space; view-independent document representation; Aerospace electronics; IEEE transactions; Noise measurement; Resource management; Speech; Speech processing; Vocabulary; C-vector; classification; factor analysis; latent Dirichlet allocation;
  • fLanguage
    English
  • Journal_Title
    Audio, Speech, and Language Processing, IEEE/ACM Transactions on
  • Publisher
    ieee
  • ISSN
    2329-9290
  • Type

    jour

  • DOI
    10.1109/TASLP.2015.2431854
  • Filename
    7105388