DocumentCode :
165888
Title :
A supervised approach to distinguish between keywords and stopwords using probability distribution functions
Author :
Sharan, Aditi ; Siddiqi, Sifatullah
Author_Institution :
Sch. of Comput. & Syst. Sci., Jawaharlal Nehru Univ., New Delhi, India
fYear :
2014
fDate :
24-27 Sept. 2014
Firstpage :
1074
Lastpage :
1080
Abstract :
This paper presents a novel probability based approach for distinguishing between keyword and stopword from a text corpus. This has a lot of applications including automatic construction of stopword list. First objective of this paper is to investigate the role of probability distribution for distinguishing between keyword and stopword. Second objective is to compare the performance of probability distributions of various weighting measures for the purpose of identifying keyword and stopword. Main characteristics of our method are that it is corpus base, supervised and computationally very efficient. Being corpus based the method is independent of the language used. However we have tested the approach on a domain specific corpus in Hindi. In Hindi (including many Indian languages), it has a great significance as a standard list of stopwords is not available. The results are encouraging and we are able to achieve 74% accuracy. However as this is a preliminary attempt, there is a great scope for improvement.
Keywords :
data mining; information retrieval; learning (artificial intelligence); natural language processing; statistical distributions; text analysis; Hindi; Indian language; keyword identification; probability distribution function; stopword identification; supervised approach; term weighting measures; Computational modeling; Frequency measurement; Probability distribution; Standards; Testing; Training; Weight measurement; Corpus Statistics; Hindi; Keywords; Probability distribution; Stopwords; Term Weighting Measures;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Advances in Computing, Communications and Informatics (ICACCI, 2014 International Conference on
Conference_Location :
New Delhi
Print_ISBN :
978-1-4799-3078-4
Type :
conf
DOI :
10.1109/ICACCI.2014.6968206
Filename :
6968206
Link To Document :
بازگشت