A multi-strategy approach for information extraction of financial report documents

Author

Siti Mariyah;Dwi Hendratmo Widyantoro

Author_Institution

Department of Statistical Computation, Sekolah Tinggi Ilmu Statistik, Jakarta

fYear

2015

Firstpage

169

Lastpage

174

Abstract

Information extraction studies have been conducted to improve the efficiency ansd accuracy of information retrieval. We developed information extraction techniques to extract name of company, period of document, currency, revenue, and number of employee information from financial report documents automatically. Different with other works, we applied a multi-strategy approach for developing extraction techniques. We separated information based on its similar characteristics before designing extraction techniques. We assumed that the difference of characteristics owned by each information induces the difference of strategy applied. First strategy is constructing extraction techniques using rule-based extraction method on information, which has good regularity on orthographic and layout features such as name of company, period of document and currency. Second strategy is applying machine learning-based extraction method on information, which has rich contextual and list look-up features such as revenue and number of employee. On the first strategy, rule patterns are defined by combining orthographic, layout, and limited contextual features. Defined rule patterns succeed to extract information and gain precision, recall, and F1-measure more than 0.98. On the second strategy, we conducted extraction task as classification task. First, we built classification models using Naive Bayes and Support Vector Machines algorithms. Then, we extracted the most informative features to train the classification models. The best classification model is used for extraction task. Contextual and list look-up features play important role in improving extraction performance. Second strategy succeed to extract revenue and number of employee information and gains precision, recall, and F-1 measure more than 0.93.

Keywords

"Feature extraction","Data mining","Companies","Information retrieval","Layout","Internet"

Publisher

ieee

Conference_Titel

Information & Communication Technology and Systems (ICTS), 2015 International Conference on

Print_ISBN

978-1-5090-0095-1

Type

conf

DOI

10.1109/ICTS.2015.7379893

Filename

7379893