DocumentCode :
1688398
Title :
Multi-threaded data mining of EDGAR CIKs (Central Index Keys) from ticker symbols
Author :
Lyon, Douglas A.
Author_Institution :
Comput. Eng. Dept., Fairfield Univ., Fairfield, CT
fYear :
2008
Firstpage :
1
Lastpage :
7
Abstract :
This paper describes how use the Java Swing HTMLEditorKit to perform multi-threaded web data mining on the EDGAR system (Electronic Data- Gathering, Analysis, and Retrieval system). EDGAR is the SEC´s (U.S. Securities and Exchange Commission) means of automating the collection, validation, indexing, acceptance, and forwarding of submissions. Some entities are regulated by the SEC (e.g. publicly traded firms) and are required, by law, to file with the SEC. Our focus is on making use of EDGAR to get information about company filings. These offers are filed with companies, using their Central Index Key (CIK). The CIK is used on the SEC´s computer system to identify entities that filed a disclosure with the SEC. We show how to map a stock ticker symbol into a CIK. The methodology for converting the web data source into internal data structures is based on using HTML as the input into a context-sensitive parser-callback facility. Screen scraping is a popular means of data mining, but the unstructured nature of HTML pages makes this a challenge. The stop-and-wait nature of HTTP queries, as well as the non-deterministic nature of the response time, adversely impacts performance. We show that a combination of caching and multi-threading can improve performance by several orders of magnitude.
Keywords :
Java; data mining; hypermedia markup languages; information networks; multi-threading; EDGAR CIK; EDGAR system; HTMLEditorKit; HTTP queries; Java Swing; Web data source; central index keys; company filings; context-sensitive parser-callback facility; internal data structures; multithreaded Web data mining; multithreaded data mining; screen scraping; ticker symbols; Data analysis; Data mining; Data security; Data structures; HTML; Indexing; Information retrieval; Information security; Java; Performance analysis;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Parallel and Distributed Processing, 2008. IPDPS 2008. IEEE International Symposium on
Conference_Location :
Miami, FL
ISSN :
1530-2075
Print_ISBN :
978-1-4244-1693-6
Electronic_ISBN :
1530-2075
Type :
conf
DOI :
10.1109/IPDPS.2008.4536453
Filename :
4536453
Link To Document :
بازگشت