Title :
Extracting Thai Compounds Using Collocations and POS Bigram Probabilities without a POS Tagger
Author :
Aroonmanakun, Wirote
Author_Institution :
Dept. of Linguistics, Chulalongkorn Univ., Bangkok, Thailand
Abstract :
This paper presents a simple method to extract compounds using statistical collocations and POS bigram probabilities without a POS tagger. Statistical collocation was used to determine strength of word co-occurrences. Probabilities of POS sequences were used to adjust the strength of collocation within a possible compound. These probabilities were estimated from compounds found in the dictionary. Bigram and trigram words extracted from a corpus of 28 million words were ranked by two means, collocation scores and collocation scores weighted by POS pattern probabilities. Cutoff precision at every 200 points were calculated for both methods. The results showed that probabilities of POS sequences could increase the precision rate of compound extraction at certain level. The system can extract 2-word compounds and 3-word compounds at the precision rate up to 63% and 35% respectively. When eliminating bigram extractions that could be parts of trigram extraction, the precision rate is increased up to 71%.
Keywords :
grammars; linguistics; natural language processing; probability; statistical analysis; word processing; POS bigram probabilities; POS sequences; Thai; bigram extractions; bigram word; compounds extraction; part-of-speech; statistical collocations; trigram extraction; trigram word; word co-occurrences strength; Data mining; Dictionaries; Filters; Frequency; Morphology; Mutual information; Natural languages; Probability; Speech processing; Statistical analysis; Thai; collocation; compound extraction;
Conference_Titel :
Asian Language Processing, 2009. IALP '09. International Conference on
Conference_Location :
Singapore
Print_ISBN :
978-0-7695-3904-1
DOI :
10.1109/IALP.2009.33