DocumentCode
259735
Title
Iterative Hard Thresholding for Keyword Extraction from Large Text Corpora
Author
Yadlowsky, Steve ; Nakkarin, Preetum ; Jingyan Wang ; Sharma, Rishi ; El Ghaoui, Laurent
Author_Institution
Electr. Eng. & Comput. Sci, Univ. of California, Berkeley, Berkeley, CA, USA
fYear
2014
fDate
3-6 Dec. 2014
Firstpage
588
Lastpage
593
Abstract
To better understand and analyze text corpora, such as the news, it is often useful to extract keywords that are meaningfully associated with a given topic. A corpus of documents labeled by their topic can be used to approach this as a learning problem. We consider this problem through the lens of statistical text analysis, using bag-of-words frequencies as features for a sparse linear model. We demonstrate, through numerical experiments, that iterative hard thresholding (IHT) is a practical and effective algorithm for keyword-extraction from large text corpora. In fact, our implementation of IHT can quickly analyze more than 800,000 documents, returning keywords comparable to algorithms solving a Lasso problem-formulation, with significantly less computation time. Further, we generalize the analysis of the IHT algorithm to show that it is stable for rank deficient matrices, as those arising from our bag-of-words model often are.
Keywords
information retrieval; iterative methods; statistical analysis; text analysis; IHT algorithm; Lasso problem-formulation; bag-of-words frequencies; iterative hard thresholding; keyword extraction; rank deficient matrices; sparse linear model; statistical text analysis; text corpora;
fLanguage
English
Publisher
ieee
Conference_Titel
Machine Learning and Applications (ICMLA), 2014 13th International Conference on
Conference_Location
Detroit, MI
Type
conf
DOI
10.1109/ICMLA.2014.101
Filename
7033182
Link To Document