DocumentCode :
3728627
Title :
A multi-strategy approach for information extraction of financial report documents
Author :
Siti Mariyah;Dwi Hendratmo Widyantoro
Author_Institution :
Department of Statistical Computation, Sekolah Tinggi Ilmu Statistik, Jakarta
fYear :
2015
Firstpage :
169
Lastpage :
174
Abstract :
Information extraction studies have been conducted to improve the efficiency ansd accuracy of information retrieval. We developed information extraction techniques to extract name of company, period of document, currency, revenue, and number of employee information from financial report documents automatically. Different with other works, we applied a multi-strategy approach for developing extraction techniques. We separated information based on its similar characteristics before designing extraction techniques. We assumed that the difference of characteristics owned by each information induces the difference of strategy applied. First strategy is constructing extraction techniques using rule-based extraction method on information, which has good regularity on orthographic and layout features such as name of company, period of document and currency. Second strategy is applying machine learning-based extraction method on information, which has rich contextual and list look-up features such as revenue and number of employee. On the first strategy, rule patterns are defined by combining orthographic, layout, and limited contextual features. Defined rule patterns succeed to extract information and gain precision, recall, and F1-measure more than 0.98. On the second strategy, we conducted extraction task as classification task. First, we built classification models using Naive Bayes and Support Vector Machines algorithms. Then, we extracted the most informative features to train the classification models. The best classification model is used for extraction task. Contextual and list look-up features play important role in improving extraction performance. Second strategy succeed to extract revenue and number of employee information and gains precision, recall, and F-1 measure more than 0.93.
Keywords :
"Feature extraction","Data mining","Companies","Information retrieval","Layout","Internet"
Publisher :
ieee
Conference_Titel :
Information & Communication Technology and Systems (ICTS), 2015 International Conference on
Print_ISBN :
978-1-5090-0095-1
Type :
conf
DOI :
10.1109/ICTS.2015.7379893
Filename :
7379893
Link To Document :
بازگشت