Title :
Urdu noun phrase chunking: HMM based approach
Author :
Ali, Wajid ; Malik, M. Kamran ; Hussain, Sarmad ; Siddiq, Shahid ; Ali, Aasim
Author_Institution :
Dept. of Comput. Sci., Nat. Univ. of Comput. & Emerging Sci. (NUCES), Lahore, Pakistan
Abstract :
Extraction of noun phrase (NP) from text is useful for many natural language processing applications, such as name entity recognition, indexing, searching, parsing etc. We present a noun phrase chunker for Urdu which is based on a statistical approach. A 100,000 words Urdu corpus is manually tagged with NP chunk tags. The corpus is used to develop a statistical approach. Initially, a statistical approach based on standard HMM model is developed for automatics NP chunking. In Urdu phrases, the case marker (CM) indicates the end of a noun phrase and is appended at its end. Thus, if one scans the sentence in reverse order, one may be able to better predict phrase endings. So, the technique is enhanced by changing scanning direction. The technique is further enhanced by merging chunk and POS tags to achieve maximum accuracy. The results of all experiments are reported with maximum overall accuracy of 97.61% achieved using HMM based approach with extended tagset and right to left (RTL) scanning.
Keywords :
cognition; hidden Markov models; natural language processing; NP chunk tags; POS tags; Urdu noun phrase chunking; automatics NP chunking; case marker; chunk merging; natural language processing; noun phrase chunker; noun phrase extraction; phrase endings; scanning direction; standard HMM model; Cardiology; Hidden Markov models; Random access memory; Testing; HMM based chunking; NP chunking; Statistical Chunking; Urdu Noun Phrase; chunking;
Conference_Titel :
Educational and Information Technology (ICEIT), 2010 International Conference on
Conference_Location :
Chongqing
Print_ISBN :
978-1-4244-8033-3
Electronic_ISBN :
978-1-4244-8035-7
DOI :
10.1109/ICEIT.2010.5607623