• DocumentCode
    259735
  • Title

    Iterative Hard Thresholding for Keyword Extraction from Large Text Corpora

  • Author

    Yadlowsky, Steve ; Nakkarin, Preetum ; Jingyan Wang ; Sharma, Rishi ; El Ghaoui, Laurent

  • Author_Institution
    Electr. Eng. & Comput. Sci, Univ. of California, Berkeley, Berkeley, CA, USA
  • fYear
    2014
  • fDate
    3-6 Dec. 2014
  • Firstpage
    588
  • Lastpage
    593
  • Abstract
    To better understand and analyze text corpora, such as the news, it is often useful to extract keywords that are meaningfully associated with a given topic. A corpus of documents labeled by their topic can be used to approach this as a learning problem. We consider this problem through the lens of statistical text analysis, using bag-of-words frequencies as features for a sparse linear model. We demonstrate, through numerical experiments, that iterative hard thresholding (IHT) is a practical and effective algorithm for keyword-extraction from large text corpora. In fact, our implementation of IHT can quickly analyze more than 800,000 documents, returning keywords comparable to algorithms solving a Lasso problem-formulation, with significantly less computation time. Further, we generalize the analysis of the IHT algorithm to show that it is stable for rank deficient matrices, as those arising from our bag-of-words model often are.
  • Keywords
    information retrieval; iterative methods; statistical analysis; text analysis; IHT algorithm; Lasso problem-formulation; bag-of-words frequencies; iterative hard thresholding; keyword extraction; rank deficient matrices; sparse linear model; statistical text analysis; text corpora;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Machine Learning and Applications (ICMLA), 2014 13th International Conference on
  • Conference_Location
    Detroit, MI
  • Type

    conf

  • DOI
    10.1109/ICMLA.2014.101
  • Filename
    7033182