DocumentCode
3728627
Title
A multi-strategy approach for information extraction of financial report documents
Author
Siti Mariyah;Dwi Hendratmo Widyantoro
Author_Institution
Department of Statistical Computation, Sekolah Tinggi Ilmu Statistik, Jakarta
fYear
2015
Firstpage
169
Lastpage
174
Abstract
Information extraction studies have been conducted to improve the efficiency ansd accuracy of information retrieval. We developed information extraction techniques to extract name of company, period of document, currency, revenue, and number of employee information from financial report documents automatically. Different with other works, we applied a multi-strategy approach for developing extraction techniques. We separated information based on its similar characteristics before designing extraction techniques. We assumed that the difference of characteristics owned by each information induces the difference of strategy applied. First strategy is constructing extraction techniques using rule-based extraction method on information, which has good regularity on orthographic and layout features such as name of company, period of document and currency. Second strategy is applying machine learning-based extraction method on information, which has rich contextual and list look-up features such as revenue and number of employee. On the first strategy, rule patterns are defined by combining orthographic, layout, and limited contextual features. Defined rule patterns succeed to extract information and gain precision, recall, and F1-measure more than 0.98. On the second strategy, we conducted extraction task as classification task. First, we built classification models using Naive Bayes and Support Vector Machines algorithms. Then, we extracted the most informative features to train the classification models. The best classification model is used for extraction task. Contextual and list look-up features play important role in improving extraction performance. Second strategy succeed to extract revenue and number of employee information and gains precision, recall, and F-1 measure more than 0.93.
Keywords
"Feature extraction","Data mining","Companies","Information retrieval","Layout","Internet"
Publisher
ieee
Conference_Titel
Information & Communication Technology and Systems (ICTS), 2015 International Conference on
Print_ISBN
978-1-5090-0095-1
Type
conf
DOI
10.1109/ICTS.2015.7379893
Filename
7379893
Link To Document